https://www.orchest.io/ logo
Title
a

Alexsander Pereira

12/29/2022, 5:02 PM
Guys, we are having problems with a possible memory leak on the Orchest Cluster, the memory is rising until it bursts. We are monitoring with node_exporter (prometheus) and we can see this behavior.
I added a few hours of cadvisor (prometheus) monitoring to have more detail. And I saw something strange with the Orchest API... but I need more metrics time to be sure.
And it keeps going up endlessly...
j

Jacopo

12/30/2022, 9:04 AM
Hi there, sorry to hear you are having this issue. Could you tell me the version of Orchest you are running?
r

Rafael Rodrigues Santana

12/30/2022, 12:25 PM
@Jacopo v2022.10.5-1.14.0
👀 1
👍 1
a

Alexsander Pereira

12/30/2022, 2:23 PM
v2022.10.5
image.png
? 👀
This memory leak is crashing our production environment. We are having to restart Orchest to lower the memory usage before bursting.
j

Jacopo

01/03/2023, 8:20 AM
Hi @Alexsander Pereira, we are still looking into this, do you have a reproducible case?
Btw, as a band-aid for the time being, since you are building your own images, you could add
max_requests_settings
in
services/orchest-api/app/gunicorn_conf.py
(and increase the number of workers to > 1), see https://docs.gunicorn.org/en/stable/settings.html
Still looking into it!
Btw, did you find any interesting info while debugging? Anything suspicious in the
orchest-api
container that could be seen through
top
or other tools?
a

Alexsander Pereira

01/04/2023, 5:28 PM
@Jacopo We are not building the orchest-api image.
image.png
a

Allan Sene

01/04/2023, 10:29 PM
Hi, guys! I suppose that this error is difficult to reproduce. Maybe if we try a hands-on together, would we tackle this more effectively? This problem is causing us a big pain with some customers right now.
j

Jacopo

01/05/2023, 9:30 AM
@Jacopo We are not building the orchest-api image.
I see, got confused by the custom version, I guess you are building only some images of Orchest? @Allan Sene sounds good, in what time zone are you guys at? Let's find a time which works for the call
btw, I was finally able to reproduce, or better, witness an instance having the same memory leak, so far it looks like a memory leak in a dependency although I'm yet to confirm I'm not sure a hands-on debugging call is needed at this point
b

Beatriz Antunes

01/05/2023, 1:35 PM
Hello @Jacopo! Thanks for the help We work at UTC-3, can you suggest a schedule at this or next week?
j

Jacopo

01/05/2023, 4:40 PM
Hi @Beatriz Antunes , we've found the point where the memory leak happens, we are now looking into the dependency to see how to correct this, tomorrow we'll release a fix
💯 1
🙌 1
a

Allan Sene

01/05/2023, 6:56 PM
awesome news @Jacopo! thank you so much
a

Alexsander Pereira

01/06/2023, 1:55 PM
Great news!
j

Jacopo

01/06/2023, 2:40 PM
With release
v2023.01.0
the issue should be fixed! Let us know if everything looks alright @User @User
🎉 1
b

Beatriz Antunes

01/06/2023, 3:45 PM
Thank you so much Jacopo! 👌
👍 1
j

Jacopo

01/07/2023, 12:07 PM
If you have recently updated to
v2023.01.0
or
v2023.01.1
and you are experiencing issues related to git imports or ssh/git config setup please update to
v2023.01.2
r

Rafael Rodrigues Santana

01/09/2023, 5:08 PM
@Jacopo we have updated our orchest deployment to the version
v2023.01.2
, but, it seems that the problem persists.
j

Jacopo

01/09/2023, 6:04 PM
That's very peculiar, we have been keeping an eye on this and so far everything seemed to be fixed, I'll look into this first thing, sorry that you are being affected. Any notable difference when it comes to the way Orchest is deployed on your side w.r.t. a "normal" installation?
r

Rafael Rodrigues Santana

01/09/2023, 6:06 PM
Our deployment uses kubectl. The only pods that use custom images are:
auth-server
and
webserver
.
j

Jacopo

01/09/2023, 6:11 PM
I see, would you be able to share part of the
kubectl
scripts you are using to deploy Orchest that way in a DM or something? Moreover, could you send a dump of the
<http://orchest.io/v1alpha1/orchestclusters|orchest.io/v1alpha1/orchestclusters>
CRD? Example
kubectl describe orchestclusters -n orchest cluster-1
I'd also be interested in the output of
kubectl describe orchestcomponents -n orchest orchest-api
b

Beatriz Antunes

01/10/2023, 1:33 PM
@Alexsander Pereira can you send these informations to Jacopo?
r

Rafael Rodrigues Santana

01/10/2023, 2:55 PM
I'll send this information using DM @Jacopo
👍 2
j

Jacopo

01/10/2023, 2:56 PM
Looking at the CRD it looks like only the auth server and the webserver have been updated, while other images are at
2022.10.5
r

Rafael Rodrigues Santana

01/10/2023, 2:58 PM
oh, thanks, makes sense.
We have updated the version of the controller, it should update the version of the images managed by orchest... What I'm missing here?
j

Jacopo

01/10/2023, 3:05 PM
If the controller is at
v2023.01.4
and nothing is happening I'd be curious to check the logs of the controller. How did you perform updates previously? Given the custom images you are running the auth server and webserver with I'd imagine that the controller would have overwritten those during updates
r

Rafael Rodrigues Santana

01/10/2023, 3:11 PM
We have two yamls: orchest-controller.yml orchest-cluster.yml The orchest-cluster is applied after the controller is applied.
j

Jacopo

01/10/2023, 3:18 PM
The
orchest-controller.yml
is pretty much identical to the latest one we have so I'm wondering if something has gone wrong at the controller level, is there anything interest in the logs?
r

Rafael Rodrigues Santana

01/10/2023, 3:18 PM
I'm not sure, I'll send you the logs that I get from kubectl.
j

Jacopo

01/10/2023, 3:19 PM
Did you ever have the chance to verify that updates were actually working after customizing the auth and webserver images? Since other images seem to be stuck at
v2022.10.5
r

Rafael Rodrigues Santana

01/10/2023, 3:19 PM
Yea, it's working fine
Except for the memory leak, everything else is working properly
j

Jacopo

01/10/2023, 3:20 PM
What I mean is, was the update ever working in the sense that images were correctly updated to the version in use by the controller?
r

Rafael Rodrigues Santana

01/10/2023, 3:32 PM
Ah, now I understood. We are not sure, we always assumed that because the webserver and auth server have been updated properly, the rest have been updated correctly as well. In the past, we have used the version v2022.10.0. So the version has been updated, however, we don't know if it was updated because the update process ran properly or because we had to recreate the cluster sometime in the past.
👀 1
j

Jacopo

01/10/2023, 3:45 PM
So far the one thing that looks suspicious is that the
orchest-database
component has a "last-applied-configuration" that contains the
orchest-cluster
, but I'm no k8s guru so it could be of no importance and/or correct; how are you performing updates? By applying (in this order) orchest-controller.yml, orchest-cluster.yml and that's it? I'll try the same on my end
r

Rafael Rodrigues Santana

01/10/2023, 3:46 PM
Yea. I think we have a suspect. In the past, we passed the version parameter to the orchest-cluster.yml, but for some reason, we have removed the parameter from the yaml.
We're going to add it again and make the update process run again.
👀 1
j

Jacopo

01/10/2023, 3:55 PM
I've applied your
files/orchest-cluster.yml
on my cluster and so far so good
I'll try changing the controller version and applying again to simulate an update
r

Rafael Rodrigues Santana

01/10/2023, 3:57 PM
It worked by adding the
version
to the
orchest-controller
j

Jacopo

01/10/2023, 3:58 PM
Oh, that's great to hear!
Question, did you mean the
version
in the orchest cluster CRD? Because I see that it was previously at
v2022.10.5
and it's not at
v2022.01.0
, so I guess that was the missing piece of the puzzle?
Not sure if you have already done that, but take a look at the
orchest-cli
at
orchest-cli/orchestcli/cmds.py
(orchest/orchest: Orchest is a tool for creating data science pipelines.), both the controller and the orchest cluster yaml should be updated
r

Rafael Rodrigues Santana

01/10/2023, 4:03 PM
Yea, exactly, by not providing the version to the
orchest-cluster.yml
, the version of the controller managed pods was not being updated.
j

Jacopo

01/10/2023, 4:04 PM
Alright sounds good
r

Rafael Rodrigues Santana

01/10/2023, 4:07 PM
Thanks for the support @Jacopo 🙂
j

Jacopo

01/10/2023, 4:09 PM
No problem! Please let me know if everything is working!
r

Rafael Rodrigues Santana

01/10/2023, 4:10 PM
We will monitor the process now to see if the memory leaks ceases hehehe
j

Jacopo

01/10/2023, 4:10 PM
👍
pretty sure it will!
r

Rafael Rodrigues Santana

01/19/2023, 1:58 PM
{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchFields":[{"key":"metadata.name","operator":"In","values":["ip-13-0-1-179.ec2.internal"]}]}]}}}
It's pointing to the old node