Rafael Rodrigues Santana
01/18/2023, 9:19 PMkubectl describe pod environment-shell-cadca39e-f914-4198-9723-f175f87d70df-f02s5hzs -n orchest
Jacopo
01/19/2023, 8:09 AMv2022.12.0
the logic responsible for modifying the node affinities queries the k8s API to filter out nodes that aren't there anymore or are malfunctioning so that these are not used in affinities/node selectors, if no node with the desired properties (readiness, and other properties internal to product) is found the logic will assume it's in a "particular" situation and will forfeit trying to apply any node selector or affinities to the pod, to not disrupt user activitiesdef env_image_name_to_proj_uuid_env_uuid_tag(
name: str,
):
tag = None
if ":" in name:
name, tag = name.split(":")
env_uuid = name[-36:]
# Because the name has a form of <optional
# ip>/orchest-env-<proj_uuid>-<env_uuid>..., i.e. we need to skip
# the "-".
proj_uuid = name[-73:-37]
return proj_uuid, env_uuid, tag
then, in the orchest-api
db
select
*
from
environment_image_on_nodes
where
project_uuid = '<project_uuid>'
and environment_uuid = '<environment_uuid>'
and environment_image_tag = <tag>;
note that the tag is an integer, not a stringkubectl get nodes
) will also help in debugging. I have been trying to reproduce on an EKS cluster and so far I haven't had any luck, so I'm wondering if it's just a matter of restarting the sessions or if there is a deeper issue that might be revealed with more information
Another thing, is the registry running? Can pipeline runs from jobs proceed correctly?Rafael Rodrigues Santana
01/19/2023, 1:05 PMName: environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-c0avgtfv
Namespace: orchest
Priority: 0
Service Account: default
Node: <none>
Labels: app=environment-shell
pod-template-hash=656bb55868
project_uuid=7358544f-0687-430b-a332-d62e79e12a62
session_uuid=7358544f-0687-430b0ee4cfac-8c25-4bba
shell_uuid=aa88a373-60fa-4bbe-ae69-49e6d415987c-c0a21a
Annotations: <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-c0a21a-656bb55868
Init Containers:
image-puller:
Image: orchest/image-puller:v2022.10.5
Port: <none>
Host Port: <none>
Command:
/pull_image.sh
Environment:
IMAGE_TO_PULL: 10.100.0.2/orchest-env-7358544f-0687-430b-a332-d62e79e12a62-aa88a373-60fa-4bbe-ae69-49e6d415987c:5
CONTAINER_RUNTIME: docker
Mounts:
/var/run/runtime.sock from container-runtime-socket (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g75n2 (ro)
ā ~ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-13-0-1-128.ec2.internal Ready <none> 16h v1.22.12-eks-ba74326
service
is not built priorly, it seems that the session is not able to start. When our node was cycled, it seems we have to rebuild all environment images.Jacopo
01/19/2023, 1:44 PMBasically, when we have services on the project, if the environment image for theThat's extremely strange and warrants more investigation, could you provide the output of https://orchest.slack.com/archives/C045TCTCMAP/p1674119796854289?thread_ts=1674076799.179869&cid=C045TCTCMAP?is not built priorly, it seems that the session is not able to start. When our node was cycled, it seems we have to rebuild all environment imagesservice
When our node was cycled, it seems we have to rebuild all environment imagesThat shouldn't be the case, is the registry working? I think we are missing a piece of the puzzle
Rafael Rodrigues Santana
01/19/2023, 1:46 PMJacopo
01/19/2023, 1:47 PMRafael Rodrigues Santana
01/19/2023, 1:47 PMJacopo
01/19/2023, 1:48 PM10.96.0.2/orchest-env-245daab6-a472-428a-b6d4-a72bb1fac297-c56ab762-539c-4cce-9b1e-c4b00300ec6f:1
splitting the project uuid, environment uuid and tag is a bit annoying, so you can use that python snippet to do that in a terminalRafael Rodrigues Santana
01/19/2023, 1:48 PMJacopo
01/19/2023, 1:49 PMorchest-api
to see in which nodes it believes the image is,
select
*
from
environment_image_on_nodes
where
project_uuid = '<project_uuid>'
and environment_uuid = '<environment_uuid>'
and environment_image_tag = <tag>;
^ mind the tag being an integer, not a stringRafael Rodrigues Santana
01/19/2023, 1:52 PMJacopo
01/19/2023, 1:54 PMorchest-api
which images it should pull
⢠it pulls them on the node, and only then notifies the orchest-api
about the image being in that note, leading to the creation of such a recordRafael Rodrigues Santana
01/19/2023, 1:55 PMJacopo
01/19/2023, 1:56 PMkubectl get -n orchest pod <your pod> -o jsonpath='{.spec.affinity}'
Rafael Rodrigues Santana
01/19/2023, 2:00 PMenvironment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-b3b2q82f 0/1 Init:CrashLoopBackOff 4 (48s ago) 2m16s
Jacopo
01/19/2023, 4:38 PMRafael Rodrigues Santana
01/19/2023, 4:39 PMINFO:root:data-app phase is Pending.
INFO:root:data-app is pending.
INFO:root:data-app phase is Pending.
INFO:root:data-app is pending.
Jacopo
01/19/2023, 4:40 PMRafael Rodrigues Santana
01/19/2023, 4:41 PMEvents:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m59s default-scheduler Successfully assigned orchest/environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85ajssrm to ip-13-0-1-128.ec2.internal
Normal Pulled 3m24s (x5 over 4m54s) kubelet Container image "orchest/image-puller:v2022.10.5" already present on machine
Normal Created 3m24s (x5 over 4m54s) kubelet Created container image-puller
Normal Started 3m24s (x5 over 4m54s) kubelet Started container image-puller
Warning BackOff 2m57s (x10 over 4m51s) kubelet Back-off restarting failed container
v2022.10.5
Jacopo
01/19/2023, 4:42 PMRafael Rodrigues Santana
01/19/2023, 4:44 PMDefaulted container "environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85a9e0" out of: environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85a9e0, image-puller (init)
Error from server (BadRequest): container "environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85a9e0" in pod "environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85ajssrm" is waiting to start: PodInitializing
Jacopo
01/19/2023, 4:49 PMRafael Rodrigues Santana
01/19/2023, 8:34 PMā ~ kubectl logs environment-shell-aa88a373-60fa-4bbe-ae69-49e6d415987c-85ajssrm -c image-puller -n orchest
Docker pull failed, pulling with buildah.
Error response from daemon: manifest for 10.100.0.2/orchest-env-7358544f-0687-430b-a332-d62e79e12a62-aa88a373-60fa-4bbe-ae69-49e6d415987c:5 not found: manifest unknown: manifest unknown
Trying to pull 10.100.0.2/orchest-env-7358544f-0687-430b-a332-d62e79e12a62-aa88a373-60fa-4bbe-ae69-49e6d415987c:5...
initializing source <docker://10.100.0.2/orchest-env-7358544f-0687-430b-a332-d62e79e12a62-aa88a373-60fa-4bbe-ae69-49e6d415987c:5>: reading manifest 5 in 10.100.0.2/orchest-env-7358544f-0687-430b-a332-d62e79e12a62-aa88a373-60fa-4bbe-ae69-49e6d415987c: manifest unknown: manifest unknown
Jacopo
01/20/2023, 8:19 AM