https://www.orchest.io/ logo
r

Rafael Rodrigues Santana

02/16/2023, 2:04 PM
Hello guys, some of the jupyter-servers in our custom deployment are stucked in the following state:
Copy code
Awaiting boot lock...
Awaiting boot lock...
Awaiting boot lock...
Awaiting boot lock...
Awaiting boot lock...
Awaiting boot lock...
Awaiting boot lock...
Awaiting boot lock...
Awaiting boot lock...
Awaiting boot lock...
Copy code
Because of this, we are unable to create sessions in those projects.

I have also noticed that by restarting the orchest cluster using the **restart** button, some of the projects that were not working are fixed, but other projects that were working starts to have the same issue.

Some important points:
- I have not found a lock in the $lockdir
- By running the /start.sh with the same parameters, I could reproduce the same result in jupyter-servers that are not working.
j

Jacopo

02/16/2023, 4:00 PM
Hi, could you exec into the orchest webserver and take a look at what's in the
.orchest/user-configurations/jupyterlab/lab
path of the userdir? I'd expect the lock to be there What's in the
/usr/local/share/jupyter/lab
path of jupyter servers that are and are not failing? I.e. is the lock there for any of them?
If the lock is not there, I'd exec into the failing jupyter servers and take a look at the result of
Copy code
userdir_path=/usr/local/share/jupyter/lab
lockdir=$userdir_path/.bootlock
mkdir $lockdir
I guess the error message will be interesting
r

Rafael Rodrigues Santana

02/16/2023, 5:29 PM
Yea, it's there:
The interesting thing is that the lock is in present in both scenarios:
Logs from a server that's working properly:
j

Jacopo

02/17/2023, 8:02 AM
Have you tried deleting the lock file? That should pretty much suffice, this snippet of logic
Copy code
release_lock() {
    rm -rf $lockdir
}
is run once the startup logic of the jupyter server is done, perhaps something has gone wrong and a dangling lock has been left around
r

Rafael Rodrigues Santana

02/17/2023, 12:41 PM
I haven't tried. Should I do it on the webserver or on each of the containers?
j

Jacopo

02/17/2023, 12:43 PM
Only once, doesn't matter from which container since those directories are mounted, webserver is ok
r

Rafael Rodrigues Santana

02/17/2023, 12:43 PM
Makes sense.
I'll try it right now, thanks for the support, as always, Jacopo.
j

Jacopo

02/17/2023, 12:43 PM
no problem 👍
r

Rafael Rodrigues Santana

02/17/2023, 12:49 PM
Unfortunately, it didn't work =x
j

Jacopo

02/17/2023, 12:49 PM
Have you tried restarting the jupyter server pods that are stuck? Now that the lock is there it shouldn't be a problem anymore
r

Rafael Rodrigues Santana

02/17/2023, 12:50 PM
yea, I killed those pods
j

Jacopo

02/17/2023, 12:51 PM
are they still stuck waiting on the lock? How many such pods? Could it be that's because there are many jupyter server going through the logic sequentially? I.e. one lock acquire at a time
r

Rafael Rodrigues Santana

02/17/2023, 12:52 PM
Currently, 7 of them are stucked in CrashLoopBackoff
I'll kill the deployments and try this process one by one to see if it works
j

Jacopo

02/17/2023, 12:53 PM
I see, I'd be curious to see what they log before crashing
r

Rafael Rodrigues Santana

02/17/2023, 2:25 PM
By deleting the deployments and starting them one by one, I was able to fix this issue. Thank you very much, @Jacopo.
👍 1
7 Views