https://www.orchest.io/ logo
Title
m

Max Zwiessele

01/10/2022, 10:54 AM
I'm running locally and I'm trying to install pytorch, but during the installation in the enironments the container is killed at the downloading stage. Did anyone get pytorch installed into an environment before?
j

Jacopo

01/10/2022, 10:56 AM
Hi Max, thanks for reporting the issue. I installed pytorch on multiple occasions so this is new to me. I wonder if this might be related to GPU pass-through
r

Rick Lamers

01/10/2022, 10:58 AM
The installation phase shouldn't be affected by Docker GPU pass-through. Did you check on disk availability? PyTorch takes quite some space in extracted form (post install).
Thanks for reporting the issue by the way šŸ™‚
j

Jacopo

01/10/2022, 11:01 AM
Could you: • try to build the environment again • run
docker exec celery-worker cat celery_builds.log
and report the result here?
m

Max Zwiessele

01/10/2022, 11:04 AM
Given no hashes to check 21 links for project 'torch': discarding no candidates
Using version 1.10.1 (newest of versions: 0.4.1, 0.4.1.post2, 1.0.0, 1.0.1, 1.0.1.post2, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1)
Collecting torch
  Created temporary directory: /tmp/pip-unpack-jja34bo0
  Looking up "<https://files.pythonhosted.org/packages/20/8a/c1e970cf64a1fa105bc5064b353ecabe77974b69029a80d04580fee38d5f/torch-1.10.1-cp37-cp37m-manylinux1_x86_64.whl>" in the cache
  No cache entry available

[2022-01-10 10:25:26,217: INFO/ForkPoolWorker-6] output:   Starting new HTTPS connection (1): <http://files.pythonhosted.org:443|files.pythonhosted.org:443>

[2022-01-10 10:25:26,242: INFO/ForkPoolWorker-6] output:   <https://files.pythonhosted.org:443> "GET /packages/20/8a/c1e970cf64a1fa105bc5064b353ecabe77974b69029a80d04580fee38d5f/torch-1.10.1-cp37-cp37m-manylinux1_x86_64.whl HTTP/1.1" 200 881907340
  Downloading torch-1.10.1-cp37-cp37m-manylinux1_x86_64.whl (881.9 MB)

[2022-01-10 10:26:19,096: INFO/MainProcess] missed heartbeat from celery@worker-interactive
[2022-01-10 10:26:37,643: INFO/ForkPoolWorker-6] output: Removing intermediate container 028e028af174

[2022-01-10 10:26:37,679: INFO/ForkPoolWorker-6] output:  ---> Running in c48f2f9ada62

[2022-01-10 10:26:37,890: INFO/ForkPoolWorker-6] output: Successfully built a088f19263c3

[2022-01-10 10:26:37,903: INFO/ForkPoolWorker-6] output: Successfully tagged orchest-env-90caa39b-32b5-44a3-96ef-4f066e303ccc-381d92b7-3b3f-4384-b778-16a7fbccbeac:latest

[2022-01-10 10:26:37,918: INFO/ForkPoolWorker-6] task done, status: SUCCESS
[2022-01-10 10:26:37,920: INFO/ForkPoolWorker-6] [Killed] child_pid: 135
[2022-01-10 10:26:38,152: INFO/ForkPoolWorker-6] Task app.core.tasks.build_environment[42f9dbd6-7751-4440-8157-4f27eaada108] succeeded in 105.47356860000218s: 'SUCCESS'
j

Jacopo

01/10/2022, 11:06 AM
What status is the environment build panel reporting?
m

Max Zwiessele

01/10/2022, 11:06 AM
Build status:SUCCESS
Build requested:Jan 10, 2022 10:24 AM
Build finished:Jan 10, 2022 10:26 AM
It might just be I need to give the worker more memory
j

Jacopo

01/10/2022, 11:08 AM
How much memory does your worker have currently? I would say that would be a good first thing to try, we are brainstorming a bit to understand what could have happened
r

Rick Lamers

01/10/2022, 11:08 AM
How did you infer that it did not in fact successfully built the environment? Where is it "breaking" so to speak?
m

Max Zwiessele

01/10/2022, 11:15 AM
I'm running a
jupyterlab
notebook and try to run
import torch
which says it's not available.
j

Jacopo

01/10/2022, 11:17 AM
Assuming the right environment is set for the step, could you try restarting the kernel and/or the orchest session?
r

Rick Lamers

01/10/2022, 11:17 AM
Only a kernel restart should be sufficient
Just to make sure, you have a single environment in the project in which you're testing?
m

Max Zwiessele

01/10/2022, 11:25 AM
Your very first suggestion worked flawlessly šŸ‘ More memory and it worked like a charm. Should have thought of that...
j

Jacopo

01/10/2022, 11:38 AM
Happy to hear it worked Max, FYI, we are working on improving resource tracking and management
j

juanlu

01/10/2022, 11:40 AM
Downloading torch-1.10.1-cp37-cp37m-manylinux1_x86_64.whl (881.9 MB)
😬 https://github.com/pypa/pip/issues/2984#issuecomment-789708327 (pip uses ~2.7 GB of RAM to install that particular wheel)
šŸ¤“ 1
y

Yannick

01/10/2022, 11:45 AM
@Jacopo Could we catch the error code and display a help message in Orchest to help debug this issue (docs - so we catch
137
exit code)?
j

Jacopo

01/10/2022, 11:53 AM
That should be possible, depends on what kind of scheduling we are looking for (thinking of the multi node migration).
šŸ‘€ 1