https://www.orchest.io/ logo
r

Rafael Rodrigues Santana

01/18/2023, 9:19 PM
Hi guys, today our node was decommissioned by kubernetes because kubernetes was unable to connect to it. Kubernetes has created another and our sessions are unable to start up. I have already deleted the reference to the old node in the orchest database. Any idea on how to stabilize the sessions?
image.png
Log from one of the session side car pods:
image.png
It seems that we have started to have a node affinity problem in the environments: Warning FailedScheduling 3m24s (x70 over 83m) default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
This is the output for this command:
kubectl describe pod environment-shell-cadca39e-f914-4198-9723-f175f87d70df-f02s5hzs -n orchest
j

Jacopo

01/19/2023, 8:09 AM
Hi @Rafael Rodrigues Santana, I'll be looking into this
Could you provide the node selectors of the pods that are stuck? Have you tried stopping and restarting the sessions with this issue?
The reason I'm asking is that since
v2022.12.0
the logic responsible for modifying the node affinities queries the k8s API to filter out nodes that aren't there anymore or are malfunctioning so that these are not used in affinities/node selectors, if no node with the desired properties (readiness, and other properties internal to product) is found the logic will assume it's in a "particular" situation and will forfeit trying to apply any node selector or affinities to the pod, to not disrupt user activities
By stopping and restarting the session all the related manifests will be created from scratch, changing the node selector/affinities involved
Another thing that could help in the debugging, get the image of the environment shell pod, run it through this code to get its project uuid, environment uuid and tag,
Copy code
def env_image_name_to_proj_uuid_env_uuid_tag(
    name: str,
):
    tag = None
    if ":" in name:
        name, tag = name.split(":")
    env_uuid = name[-36:]
    # Because the name has a form of <optional
    # ip>/orchest-env-<proj_uuid>-<env_uuid>..., i.e. we need to skip
    # the "-".
    proj_uuid = name[-73:-37]
    return proj_uuid, env_uuid, tag
then, in the
orchest-api
db
Copy code
select 
  * 
from 
  environment_image_on_nodes 
where 
  project_uuid = '<project_uuid>' 
  and environment_uuid = '<environment_uuid>' 
  and environment_image_tag = <tag>;
note that the tag is an integer, not a string
The list of nodes along with their status (
kubectl get nodes
) will also help in debugging. I have been trying to reproduce on an EKS cluster and so far I haven't had any luck, so I'm wondering if it's just a matter of restarting the sessions or if there is a deeper issue that might be revealed with more information Another thing, is the registry running? Can pipeline runs from jobs proceed correctly?
r

Rafael Rodrigues Santana

01/19/2023, 1:05 PM
@Jacopo here are the labels for one of the pods that are tucked:
Copy code
Name:             environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-c0avgtfv
Namespace:        orchest
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=environment-shell
                  pod-template-hash=656bb55868
                  project_uuid=7358544f-0687-430b-a332-d62e79e12a62
                  session_uuid=7358544f-0687-430b0ee4cfac-8c25-4bba
                  shell_uuid=aa88a373-60fa-4bbe-ae69-49e6d415987c-c0a21a
Annotations:      <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-c0a21a-656bb55868
Init Containers:
  image-puller:
    Image:      orchest/image-puller:v2022.10.5
    Port:       <none>
    Host Port:  <none>
    Command:
      /pull_image.sh
    Environment:
      IMAGE_TO_PULL:      10.100.0.2/orchest-env-7358544f-0687-430b-a332-d62e79e12a62-aa88a373-60fa-4bbe-ae69-49e6d415987c:5
      CONTAINER_RUNTIME:  docker
    Mounts:
      /var/run/runtime.sock from container-runtime-socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g75n2 (ro)
Copy code
āžœ  ~ kubectl get nodes
NAME                         STATUS   ROLES    AGE   VERSION
ip-13-0-1-128.ec2.internal   Ready    <none>   16h   v1.22.12-eks-ba74326
Yea, I've tried restarting the sessions, but had no luck.
šŸ‘€ 1
@Jacopo I think I've figured out the issue. Basically, when we have services on the project, if the environment image for the
service
is not built priorly, it seems that the session is not able to start. When our node was cycled, it seems we have to rebuild all environment images.
j

Jacopo

01/19/2023, 1:44 PM
@Rafael Rodrigues Santana I don't see the node selector labels and affinities in the yml you posted
Basically, when we have services on the project, if the environment image for the
service
is not built priorly, it seems that the session is not able to start. When our node was cycled, it seems we have to rebuild all environment images
That's extremely strange and warrants more investigation, could you provide the output of https://orchest.slack.com/archives/C045TCTCMAP/p1674119796854289?thread_ts=1674076799.179869&amp;cid=C045TCTCMAP?
When our node was cycled, it seems we have to rebuild all environment images
That shouldn't be the case, is the registry working? I think we are missing a piece of the puzzle
r

Rafael Rodrigues Santana

01/19/2023, 1:46 PM
When you say get the image, you mean a kubectl exec in the pod to run the code you sent me?
j

Jacopo

01/19/2023, 1:47 PM
No, I meant get the image that the pod is using
r

Rafael Rodrigues Santana

01/19/2023, 1:47 PM
I see.
IMAGE_TO_PULL: 10.100.0.2/orchest-env-33d5abf6-9933-4c82-bfc7-616e6309efee-cadca39e-f914-4198-9723-f175f87d70df:3
j

Jacopo

01/19/2023, 1:48 PM
Since the image name is formatted like this:
10.96.0.2/orchest-env-245daab6-a472-428a-b6d4-a72bb1fac297-c56ab762-539c-4cce-9b1e-c4b00300ec6f:1
splitting the project uuid, environment uuid and tag is a bit annoying, so you can use that python snippet to do that in a terminal
r

Rafael Rodrigues Santana

01/19/2023, 1:48 PM
Ok, let me do it.
j

Jacopo

01/19/2023, 1:49 PM
Once you have those, query the
orchest-api
to see in which nodes it believes the image is,
Copy code
select 
  * 
from 
  environment_image_on_nodes 
where 
  project_uuid = '<project_uuid>' 
  and environment_uuid = '<environment_uuid>' 
  and environment_image_tag = <tag>;
^ mind the tag being an integer, not a string
Just to make sure, on which version of Orchest is this instance?
r

Rafael Rodrigues Santana

01/19/2023, 1:52 PM
image.png
j

Jacopo

01/19/2023, 1:54 PM
Assuming this isn't an image that was just built but was built on the old node, this confirms that things are working in this regard. When you add a new node, this happens: • the node-agent on the node checks with the
orchest-api
which images it should pull • it pulls them on the node, and only then notifies the
orchest-api
about the image being in that note, leading to the creation of such a record
r

Rafael Rodrigues Santana

01/19/2023, 1:55 PM
Hmm, let me check if it wasn't built just now.
j

Jacopo

01/19/2023, 1:56 PM
I'd also be interested in the output of
kubectl get -n orchest pod <your pod> -o jsonpath='{.spec.affinity}'
That seems to be pointing to the old node, any chance that this environment shell belongs to a session that has not been restarted?
r

Rafael Rodrigues Santana

01/19/2023, 2:00 PM
Let me try to restart just the session, without rebuilding the environment.
šŸ‘ 1
Not sure if I made some confusion or if something changed in the cluster today, but now the session restart is working properly to solve this issue with environment pods hanging. When the session is restarted, the pending pod is killed and a new one is started to replace the old one.
I'll do this to make sure this works for every project we have.
šŸ‘€ 1
šŸ‘ 1
@Jacopo I found a project where this issue is still happening. The session is keep trying to be created, but is lopping forever. The environment is in state:
environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-b3b2q82f   0/1     Init:CrashLoopBackOff   4 (48s ago)      2m16s
j

Jacopo

01/19/2023, 4:38 PM
Likely unrelated to the affinity issue, any logs or statuses about the failure?
Environment shell failures do not hinder a session start as far as I remember šŸ¤”
r

Rafael Rodrigues Santana

01/19/2023, 4:39 PM
Here is the session-side car logs:
Copy code
INFO:root:data-app phase is Pending.
INFO:root:data-app is pending.
INFO:root:data-app phase is Pending.
INFO:root:data-app is pending.
j

Jacopo

01/19/2023, 4:40 PM
These logs just report the status of the pod, but what do you get when describing the failing pod?
r

Rafael Rodrigues Santana

01/19/2023, 4:41 PM
Copy code
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  4m59s                   default-scheduler  Successfully assigned orchest/environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85ajssrm to ip-13-0-1-128.ec2.internal
  Normal   Pulled     3m24s (x5 over 4m54s)   kubelet            Container image "orchest/image-puller:v2022.10.5" already present on machine
  Normal   Created    3m24s (x5 over 4m54s)   kubelet            Created container image-puller
  Normal   Started    3m24s (x5 over 4m54s)   kubelet            Started container image-puller
  Warning  BackOff    2m57s (x10 over 4m51s)  kubelet            Back-off restarting failed container
This version is interesting:
v2022.10.5
j

Jacopo

01/19/2023, 4:42 PM
Was this the instance which update through raw yaml changes wasn't entirely correct?
Although I don't think the failure is due to the old puller, any logs from the shell container?
r

Rafael Rodrigues Santana

01/19/2023, 4:44 PM
Copy code
Defaulted container "environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85a9e0" out of: environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85a9e0, image-puller (init)
Error from server (BadRequest): container "environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85a9e0" in pod "environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85ajssrm" is waiting to start: PodInitializing
j

Jacopo

01/19/2023, 4:49 PM
Might have to wait for the pod to initialize to see the logs of the error
r

Rafael Rodrigues Santana

01/19/2023, 8:34 PM
There was a problem in the init container of the environment:
Copy code
āžœ  ~ kubectl logs environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85ajssrm -c image-puller -n orchest
Docker pull failed, pulling with buildah.
Error response from daemon: manifest for 10.100.0.2/orchest-env-7358544f-0687-430b-a332-d62e79e12a62-aa88a373-60fa-4bbe-ae69-49e6d415987c:5 not found: manifest unknown: manifest unknown
Trying to pull 10.100.0.2/orchest-env-7358544f-0687-430b-a332-d62e79e12a62-aa88a373-60fa-4bbe-ae69-49e6d415987c:5...
initializing source <docker://10.100.0.2/orchest-env-7358544f-0687-430b-a332-d62e79e12a62-aa88a373-60fa-4bbe-ae69-49e6d415987c:5>: reading manifest 5 in 10.100.0.2/orchest-env-7358544f-0687-430b-a332-d62e79e12a62-aa88a373-60fa-4bbe-ae69-49e6d415987c: manifest unknown: manifest unknown
by rebuilding the environment, the problem was solved in this case
j

Jacopo

01/20/2023, 8:19 AM
Any chance that the old node failed while the build was going on? As a general remark, when removing nodes from the cluster my advice would be to stop Orchest, remove the node and then start Orchest again, that's pretty much what we do when changing the node pool for our multi node cloud offering. This ensures that state remains consistent, since as of now some of the logic, like some cases for the pod scheduling, assume that nodes aren't removed while running
3 Views