https://www.orchest.io/ logo
Title
s

Serhii

11/03/2021, 8:13 AM
Weird flow for out of memory: 1. instance becomes inaccessible "Instance undergoing restart or update" 2. instance recovers in half hour + and seems the step in pipeline is shown as running but apparently it should have been crashed
r

Rick Lamers

11/03/2021, 8:21 AM
So this problem is going to fundamentally be addressed as we’re migrating everything to a multinode k8s backend. That’s going to be able to separate the Orchest node from the compute nodes. In the meantime we’ll look at how to make it a better experience sooner as we believe the k8s version is 3/4 months out. Thanks for reporting.
s

Serhii

11/03/2021, 8:23 AM
Nice to hear. I guess pricing scheme would be changed as well, as for now it looks its closely following AWS instance on-demand pricing :)
j

Jacopo

11/03/2021, 8:27 AM
@Serhii are you still experiencing issues with your instance as of now? I am taking a look
s

Serhii

11/03/2021, 8:28 AM
I did a second run and didn't guess limiting number 🙂
@Jacopo if its all manually solved - will be doing a slower increments to guess it
sorry for taking your time
j

Jacopo

11/03/2021, 8:30 AM
We really appreciate feedback, so absolutely no worries, and sorry for the unexpected behaviour
s

Serhii

11/03/2021, 9:23 AM
my instance still broken 😞 please ping if there is a solution
j

Jacopo

11/03/2021, 9:25 AM
It seems that it has indeed become unresponsive, I am looking into it. For the sake of debugging, how cpu heavy were the ongoing jobs? How many concurrent steps, runs, etc.
My current guess is that the instance run out of memory or the high cpu usage led to other internal services failures. That's what it looks like from our internal dashboard
@Serhii I have rebooted your instance. My suggestion would be to try to cap the amount of parallelism and memory usage to avoid this issue, to avoid chocking the instance (the free tier has 2 vcpus and 8GB of ram). As @Rick Lamers said a solution is on the horizon, but we will try deliver a solution in the more short term.
s

Serhii

11/03/2021, 9:49 AM
@Jacopo like I said, it was definitely out of memory, I know the reason. I was trying to reduce sample size but got a wrong guess again. Actually not completely wrong. Job succeeded with memory 7 out of 8 Gb But seems it still broke smth. Thank you for the support.
j

Jacopo

11/03/2021, 9:50 AM
I see. By the way, were you able to stop and start the instance through the cloud dashboard?
My suggestion, as a temporary solution, would be to simply stop and start it again if you happen to get into memory issues again
s

Serhii

11/03/2021, 10:14 AM
Okay, I wasn't willing to do so to not leave system in "undefined" state. But if you say its safe - ok
👍 2