https://www.orchest.io/ logo
r

Rafael Rodrigues Santana

11/30/2022, 10:44 PM
Guys, today, out of the blue our deployment started to get the following error:
Copy code
Unable to attach or mount volumes: unmounted volumes=[userdir-pvc], unattached volumes=[container-runtime-socket kube-api-access-jh5mn userdir-pvc]: timed out waiting for the condition
We sshed into the worker machine and the userdir volume is attached. Most pods are failing with the above message. Any thoughts on what may be the cause of this issue?
j

Jacopo

12/01/2022, 8:51 AM
I've never met this particular issue so I'm not sure I can really help here 🤔 , although the content of the logs makes me suspect it might not be Orchest related. What kind of setup do you have? Are you using the EBS CSI driver to provide volumes? Is this on EKS?
r

Rafael Rodrigues Santana

12/01/2022, 11:54 AM
Yes, we're using EBS CSI driver to provide the volumes and the orchest is deployed on EKS. We're experiencing some intermittency, after killing some of the deployments, we were able to startup the database, webserver, etc, but the containers that startup afterward (Environments, Jupyter servers, etc) are now raising the same error.
j

Jacopo

12/01/2022, 12:01 PM
I see. How many nodes are there in the cluster?
r

Rafael Rodrigues Santana

12/01/2022, 12:01 PM
Only one worker node
j

Jacopo

12/01/2022, 12:03 PM
Any interesting logs from the
ebs-csi-controller
or
ebs-csi-node
pods?
r

Rafael Rodrigues Santana

12/01/2022, 12:12 PM
Actually no..
Copy code
➜  ~ kubectl logs ebs-csi-controller-d54f948bf-frmlb -n kube-system
Defaulted container "ebs-plugin" out of: ebs-plugin, csi-provisioner, csi-attacher, csi-snapshotter, csi-resizer, liveness-probe

➜  ~ kubectl logs ebs-csi-controller-d54f948bf-xc5n2 -n kube-system
Defaulted container "ebs-plugin" out of: ebs-plugin, csi-provisioner, csi-attacher, csi-snapshotter, csi-resizer, liveness-probe

➜  ~ kubectl logs ebs-csi-node-llc5c -n kube-system
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe
I1201 12:09:40.089632       1 node.go:98] regionFromSession Node service 
I1201 12:09:40.089731       1 metadata.go:85] retrieving instance data from ec2 metadata
I1201 12:09:40.091852       1 metadata.go:92] ec2 metadata is available
I1201 12:09:40.092874       1 metadata_ec2.go:25] regionFromSession 
I1201 12:09:40.095856       1 mount_linux.go:207] Detected OS without systemd
image.png
image.png
image.png
j

Jacopo

12/01/2022, 12:21 PM
Note that both the ebs csi controller and node pods have multiple containers so the logs you are getting with
kubectl logs
default to a container chosen by kubectl
r

Rafael Rodrigues Santana

12/01/2022, 12:25 PM
Didn't know about that.
Copy code
➜  ~ kubectl logs ebs-csi-node-llc5c -n kube-system --all-containers=true
I1201 12:09:40.020121       1 main.go:166] Version: v2.5.1-1-g9ad99c33
I1201 12:09:40.020195       1 main.go:167] Running node-driver-registrar in mode=registration
I1201 12:09:40.089891       1 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1201 12:09:41.091326       1 main.go:198] Calling CSI driver to discover driver name
I1201 12:09:41.092671       1 main.go:208] CSI driver name: "<http://ebs.csi.aws.com|ebs.csi.aws.com>"
I1201 12:09:41.092734       1 node_register.go:53] Starting Registration Server at: /registration/ebs.csi.aws.com-reg.sock
I1201 12:09:41.093061       1 node_register.go:62] Registration Server started at: /registration/ebs.csi.aws.com-reg.sock
I1201 12:09:41.093294       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I1201 12:09:41.805505       1 main.go:102] Received GetInfo call: &InfoRequest{}
E1201 12:09:41.805788       1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/ebs.csi.aws.com/registration"
I1201 12:09:41.805829       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/ebs.csi.aws.com/registration"
I1201 12:09:41.844721       1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
I1201 12:09:40.190990       1 main.go:149] calling CSI driver to discover driver name
I1201 12:09:40.193136       1 main.go:155] CSI driver name: "<http://ebs.csi.aws.com|ebs.csi.aws.com>"
I1201 12:09:40.193150       1 main.go:183] ServeMux listening at "0.0.0.0:9808"
I1201 12:09:40.089632       1 node.go:98] regionFromSession Node service 
I1201 12:09:40.089731       1 metadata.go:85] retrieving instance data from ec2 metadata
I1201 12:09:40.091852       1 metadata.go:92] ec2 metadata is available
I1201 12:09:40.092874       1 metadata_ec2.go:25] regionFromSession 
I1201 12:09:40.095856       1 mount_linux.go:207] Detected OS without systemd
One of the csi controllers had an interesting log.
👀 1
a

Alexsander Pereira

12/01/2022, 12:35 PM
Could the orchest be going through some ebs limit to mount the volume?
j

Jacopo

12/01/2022, 12:37 PM
Mmh 🤔 not sure, not many mounts going on after all
Anything strange among the volume attachments?
kubectl get volumeattachments
a

Alexsander Pereira

12/01/2022, 12:39 PM
Because I know that there is a limit for attaching EBS volumes of 25, but would mounts count in this limit?
j

Jacopo

12/01/2022, 12:42 PM
Mmh not really an expert in that particular regard but I'd be surprised if this setup was reaching such kind of limits
r

Rafael Rodrigues Santana

12/01/2022, 12:46 PM
I killed both csi controller pods. One of them is still working, while the other kept raising the same error msgs.
a

Alexsander Pereira

12/01/2022, 2:28 PM
Guys, in addition to these mount errors on volumes, we are also having these errors on environment shell pods: 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
We need help debugging these two issues on our production cluster: • Unable to attach or mount volumes: unmounted volumes=[userdir-pvc], unattached volumes=[userdir-pvc kube-api-access-tgqbj]: timed out waiting for condition • 0/1 nodes are available: 1 node(s) do not match the pod's affinity/node selector.
j

Jacopo

12/01/2022, 2:30 PM
Could you send here the affinity and node selector of the environment shell pods?
a

Alexsander Pereira

12/01/2022, 2:33 PM
How do I get this information?
j

Jacopo

12/01/2022, 2:33 PM
the deployment yaml of the environment shell
a

Alexsander Pereira

12/01/2022, 2:36 PM
Ok
Copy code
kind: Pod
apiVersion: v1
metadata:
  name: environment-shell-268bec2c-0a2f-4b5a-ab69-4df9d5e514c6-0cdvr98s
  generateName: environment-shell-268bec2c-0a2f-4b5a-ab69-4df9d5e514c6-0cde0a-69b66685bd-
  namespace: orchest
  uid: 93b8bc9a-c5b8-4e51-9d42-4f59828cf506
  resourceVersion: '7796216'
  creationTimestamp: '2022-12-01T14:25:48Z'
  labels:
    app: environment-shell
    pod-template-hash: 69b66685bd
    project_uuid: 8495ece7-ec3a-4581-aa73-999e38b27c63
    session_uuid: 8495ece7-ec3a-4581fb5467fc-7db6-41c2
    shell_uuid: 268bec2c-0a2f-4b5a-ab69-4df9d5e514c6-0cde0a
  annotations:
    <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
  ownerReferences:
    - apiVersion: apps/v1
      kind: ReplicaSet
      name: environment-shell-268bec2c-0a2f-4b5a-ab69-4df9d5e514c6-0cde0a-69b66685bd
      uid: 167abb73-ce7e-4cc7-9a1f-757d518c0060
      controller: true
      blockOwnerDeletion: true
  managedFields:
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2022-12-01T14:25:48Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:generateName: {}
          f:labels:
            .: {}
            f:app: {}
            f:pod-template-hash: {}
            f:project_uuid: {}
            f:session_uuid: {}
            f:shell_uuid: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"167abb73-ce7e-4cc7-9a1f-757d518c0060"}: {}
        f:spec:
          f:affinity:
            .: {}
            f:nodeAffinity:
              .: {}
              f:requiredDuringSchedulingIgnoredDuringExecution: {}
          f:containers:
            k:{"name":"environment-shell-268bec2c-0a2f-4b5a-ab69-4df9d5e514c6-0cde0a"}:
              .: {}
              f:args: {}
              f:command: {}
              f:env:
                .: {}
                k:{"name":"ORCHEST_PIPELINE_PATH"}:
                  .: {}
                  f:name: {}
                  f:value: {}
                k:{"name":"ORCHEST_PIPELINE_UUID"}:
                  .: {}
                  f:name: {}
                  f:value: {}
                k:{"name":"ORCHEST_PROJECT_UUID"}:
                  .: {}
                  f:name: {}
                  f:value: {}
                k:{"name":"ORCHEST_SESSION_TYPE"}:
                  .: {}
                  f:name: {}
                  f:value: {}
                k:{"name":"ORCHEST_SESSION_UUID"}:
                  .: {}
                  f:name: {}
                  f:value: {}
              f:image: {}
              f:imagePullPolicy: {}
              f:name: {}
              f:ports:
                .: {}
                k:{"containerPort":22,"protocol":"TCP"}:
                  .: {}
                  f:containerPort: {}
                  f:protocol: {}
              f:resources:
                .: {}
                f:requests:
                  .: {}
                  f:cpu: {}
              f:startupProbe:
                .: {}
                f:exec:
                  .: {}
                  f:command: {}
                f:failureThreshold: {}
                f:periodSeconds: {}
                f:successThreshold: {}
                f:timeoutSeconds: {}
              f:terminationMessagePath: {}
              f:terminationMessagePolicy: {}
              f:volumeMounts:
                .: {}
                k:{"mountPath":"/data"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
                  f:subPath: {}
                k:{"mountPath":"/pipeline.json"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
                  f:subPath: {}
                k:{"mountPath":"/project-dir"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
                  f:subPath: {}
          f:dnsConfig:
            .: {}
            f:options: {}
          f:dnsPolicy: {}
          f:enableServiceLinks: {}
          f:initContainers:
            .: {}
            k:{"name":"image-puller"}:
              .: {}
              f:command: {}
              f:env:
                .: {}
                k:{"name":"CONTAINER_RUNTIME"}:
                  .: {}
                  f:name: {}
                  f:value: {}
                k:{"name":"IMAGE_TO_PULL"}:
                  .: {}
                  f:name: {}
                  f:value: {}
              f:image: {}
              f:imagePullPolicy: {}
              f:name: {}
              f:resources: {}
              f:securityContext:
                .: {}
                f:privileged: {}
                f:runAsUser: {}
              f:terminationMessagePath: {}
              f:terminationMessagePolicy: {}
              f:volumeMounts:
                .: {}
                k:{"mountPath":"/var/run/runtime.sock"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
          f:restartPolicy: {}
          f:schedulerName: {}
          f:securityContext:
            .: {}
            f:fsGroup: {}
            f:runAsGroup: {}
            f:runAsUser: {}
          f:terminationGracePeriodSeconds: {}
          f:volumes:
            .: {}
            k:{"name":"container-runtime-socket"}:
              .: {}
              f:hostPath:
                .: {}
                f:path: {}
                f:type: {}
              f:name: {}
            k:{"name":"userdir-pvc"}:
              .: {}
              f:name: {}
              f:persistentVolumeClaim:
                .: {}
                f:claimName: {}
    - manager: kube-scheduler
      operation: Update
      apiVersion: v1
      time: '2022-12-01T14:25:48Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          f:conditions:
            .: {}
            k:{"type":"PodScheduled"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
      subresource: status
spec:
  volumes:
    - name: userdir-pvc
      persistentVolumeClaim:
        claimName: userdir-pvc
    - name: container-runtime-socket
      hostPath:
        path: /var/run/docker.sock
        type: Socket
    - name: kube-api-access-cghb7
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  initContainers:
    - name: image-puller
      image: orchest/image-puller:v2022.10.5
      command:
        - /pull_image.sh
      env:
        - name: IMAGE_TO_PULL
          value: >-
            10.100.0.2/orchest-env-8495ece7-ec3a-4581-aa73-999e38b27c63-268bec2c-0a2f-4b5a-ab69-4df9d5e514c6:8
        - name: CONTAINER_RUNTIME
          value: docker
      resources: {}
      volumeMounts:
        - name: container-runtime-socket
          mountPath: /var/run/runtime.sock
        - name: kube-api-access-cghb7
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
      securityContext:
        privileged: true
        runAsUser: 0
  containers:
    - name: environment-shell-268bec2c-0a2f-4b5a-ab69-4df9d5e514c6-0cde0a
      image: >-
        10.100.0.2/orchest-env-8495ece7-ec3a-4581-aa73-999e38b27c63-268bec2c-0a2f-4b5a-ab69-4df9d5e514c6:8
      command:
        - /orchest/bootscript.sh
      args:
        - shell
      ports:
        - containerPort: 22
          protocol: TCP
      env:
        - name: ORCHEST_PROJECT_UUID
          value: 8495ece7-ec3a-4581-aa73-999e38b27c63
        - name: ORCHEST_PIPELINE_UUID
          value: fb5467fc-7db6-41c2-bcc1-941631f90b3a
        - name: ORCHEST_PIPELINE_PATH
          value: /pipeline.json
        - name: ORCHEST_SESSION_UUID
          value: 8495ece7-ec3a-4581fb5467fc-7db6-41c2
        - name: ORCHEST_SESSION_TYPE
          value: interactive
      resources:
        requests:
          cpu: 1m
      volumeMounts:
        - name: userdir-pvc
          mountPath: /project-dir
          subPath: projects/demo-advanced-elections-first-round
        - name: userdir-pvc
          mountPath: /data
          subPath: data
        - name: userdir-pvc
          mountPath: /pipeline.json
          subPath: projects/demo-advanced-elections-first-round/main.orchest
        - name: kube-api-access-cghb7
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      startupProbe:
        exec:
          command:
            - echo
            - '1'
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
  restartPolicy: Always
  terminationGracePeriodSeconds: 5
  dnsPolicy: ClusterFirst
  serviceAccountName: default
  serviceAccount: default
  securityContext:
    runAsUser: 0
    runAsGroup: 1
    fsGroup: 1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchFields:
              - key: metadata.name
                operator: In
                values:
                  - ip-13-0-1-185.ec2.internal
  schedulerName: default-scheduler
  tolerations:
    - key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
  priority: 0
  dnsConfig:
    options:
      - name: timeout
        value: '10'
      - name: attempts
        value: '5'
  enableServiceLinks: true
  preemptionPolicy: PreemptLowerPriority
status:
  phase: Pending
  conditions:
    - type: PodScheduled
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2022-12-01T14:25:48Z'
      reason: Unschedulable
      message: >-
        0/1 nodes are available: 1 node(s) didn't match Pod's node
        affinity/selector.
  qosClass: Burstable
👀 1
This?
j

Jacopo

12/01/2022, 2:38 PM
the depoyment 1 would be better
kubectl -n orchest describe deployment
a

Alexsander Pereira

12/01/2022, 2:39 PM
Copy code
kind: Deployment
apiVersion: apps/v1
metadata:
  name: environment-shell-f1929b0d-c0b3-4153-a67e-474e3e9c8b61-8adc2c
  namespace: orchest
  uid: 703f3c79-b919-4e2d-a9c5-03bbe242189f
  resourceVersion: '7787664'
  generation: 1
  creationTimestamp: '2022-12-01T13:30:47Z'
  labels:
    app: environment-shell
    project_uuid: dfac15de-be36-4521-a7f7-4869923bccc3
    session_uuid: dfac15de-be36-452174c195b0-910d-4aab
    shell_uuid: f1929b0d-c0b3-4153-a67e-474e3e9c8b61-8adc2c
  annotations:
    <http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: '1'
  managedFields:
    - manager: OpenAPI-Generator
      operation: Update
      apiVersion: apps/v1
      time: '2022-12-01T13:30:47Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:labels:
            .: {}
            f:app: {}
            f:project_uuid: {}
            f:session_uuid: {}
            f:shell_uuid: {}
        f:spec:
          f:progressDeadlineSeconds: {}
          f:replicas: {}
          f:revisionHistoryLimit: {}
          f:selector: {}
          f:strategy:
            f:rollingUpdate:
              .: {}
              f:maxSurge: {}
              f:maxUnavailable: {}
            f:type: {}
          f:template:
            f:metadata:
              f:labels:
                .: {}
                f:app: {}
                f:project_uuid: {}
                f:session_uuid: {}
                f:shell_uuid: {}
              f:name: {}
            f:spec:
              f:affinity:
                .: {}
                f:nodeAffinity:
                  .: {}
                  f:requiredDuringSchedulingIgnoredDuringExecution: {}
              f:containers:
                k:{"name":"environment-shell-f1929b0d-c0b3-4153-a67e-474e3e9c8b61-8adc2c"}:
                  .: {}
                  f:args: {}
                  f:command: {}
                  f:env:
                    .: {}
                    k:{"name":"ORCHEST_PIPELINE_PATH"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"ORCHEST_PIPELINE_UUID"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"ORCHEST_PROJECT_UUID"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"ORCHEST_SESSION_TYPE"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"ORCHEST_SESSION_UUID"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:name: {}
                  f:ports:
                    .: {}
                    k:{"containerPort":22,"protocol":"TCP"}:
                      .: {}
                      f:containerPort: {}
                      f:protocol: {}
                  f:resources:
                    .: {}
                    f:requests:
                      .: {}
                      f:cpu: {}
                  f:startupProbe:
                    .: {}
                    f:exec:
                      .: {}
                      f:command: {}
                    f:failureThreshold: {}
                    f:periodSeconds: {}
                    f:successThreshold: {}
                    f:timeoutSeconds: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
                  f:volumeMounts:
                    .: {}
                    k:{"mountPath":"/data"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                      f:subPath: {}
                    k:{"mountPath":"/pipeline.json"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                      f:subPath: {}
                    k:{"mountPath":"/project-dir"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                      f:subPath: {}
              f:dnsConfig:
                .: {}
                f:options: {}
              f:dnsPolicy: {}
              f:initContainers:
                .: {}
                k:{"name":"image-puller"}:
                  .: {}
                  f:command: {}
                  f:env:
                    .: {}
                    k:{"name":"CONTAINER_RUNTIME"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"IMAGE_TO_PULL"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:name: {}
                  f:resources: {}
                  f:securityContext:
                    .: {}
                    f:privileged: {}
                    f:runAsUser: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
                  f:volumeMounts:
                    .: {}
                    k:{"mountPath":"/var/run/runtime.sock"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
              f:restartPolicy: {}
              f:schedulerName: {}
              f:securityContext:
                .: {}
                f:fsGroup: {}
                f:runAsGroup: {}
                f:runAsUser: {}
              f:terminationGracePeriodSeconds: {}
              f:volumes:
                .: {}
                k:{"name":"container-runtime-socket"}:
                  .: {}
                  f:hostPath:
                    .: {}
                    f:path: {}
                    f:type: {}
                  f:name: {}
                k:{"name":"userdir-pvc"}:
                  .: {}
                  f:name: {}
                  f:persistentVolumeClaim:
                    .: {}
                    f:claimName: {}
    - manager: kube-controller-manager
      operation: Update
      apiVersion: apps/v1
      time: '2022-12-01T13:30:47Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:<http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: {}
        f:status:
          f:conditions:
            .: {}
            k:{"type":"Available"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Progressing"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:observedGeneration: {}
          f:replicas: {}
          f:unavailableReplicas: {}
          f:updatedReplicas: {}
      subresource: status
spec:
  replicas: 1
  selector:
    matchLabels:
      app: environment-shell
      project_uuid: dfac15de-be36-4521-a7f7-4869923bccc3
      session_uuid: dfac15de-be36-452174c195b0-910d-4aab
      shell_uuid: f1929b0d-c0b3-4153-a67e-474e3e9c8b61-8adc2c
  template:
    metadata:
      name: environment-shell-f1929b0d-c0b3-4153-a67e-474e3e9c8b61-8adc2c
      creationTimestamp: null
      labels:
        app: environment-shell
        project_uuid: dfac15de-be36-4521-a7f7-4869923bccc3
        session_uuid: dfac15de-be36-452174c195b0-910d-4aab
        shell_uuid: f1929b0d-c0b3-4153-a67e-474e3e9c8b61-8adc2c
    spec:
      volumes:
        - name: userdir-pvc
          persistentVolumeClaim:
            claimName: userdir-pvc
        - name: container-runtime-socket
          hostPath:
            path: /var/run/docker.sock
            type: Socket
      initContainers:
        - name: image-puller
          image: orchest/image-puller:v2022.10.5
          command:
            - /pull_image.sh
          env:
            - name: IMAGE_TO_PULL
              value: >-
                10.100.0.2/orchest-env-dfac15de-be36-4521-a7f7-4869923bccc3-f1929b0d-c0b3-4153-a67e-474e3e9c8b61:1
            - name: CONTAINER_RUNTIME
              value: docker
          resources: {}
          volumeMounts:
            - name: container-runtime-socket
              mountPath: /var/run/runtime.sock
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            runAsUser: 0
      containers:
        - name: environment-shell-f1929b0d-c0b3-4153-a67e-474e3e9c8b61-8adc2c
          image: >-
            10.100.0.2/orchest-env-dfac15de-be36-4521-a7f7-4869923bccc3-f1929b0d-c0b3-4153-a67e-474e3e9c8b61:1
          command:
            - /orchest/bootscript.sh
          args:
            - shell
          ports:
            - containerPort: 22
              protocol: TCP
          env:
            - name: ORCHEST_PROJECT_UUID
              value: dfac15de-be36-4521-a7f7-4869923bccc3
            - name: ORCHEST_PIPELINE_UUID
              value: 74c195b0-910d-4aab-a7fb-fdd5aa975804
            - name: ORCHEST_PIPELINE_PATH
              value: /pipeline.json
            - name: ORCHEST_SESSION_UUID
              value: dfac15de-be36-452174c195b0-910d-4aab
            - name: ORCHEST_SESSION_TYPE
              value: interactive
          resources:
            requests:
              cpu: 1m
          volumeMounts:
            - name: userdir-pvc
              mountPath: /project-dir
              subPath: projects/dev-demo-advanced-retail
            - name: userdir-pvc
              mountPath: /data
              subPath: data
            - name: userdir-pvc
              mountPath: /pipeline.json
              subPath: projects/dev-demo-advanced-retail/demo_amazon_retail_new.orchest
          startupProbe:
            exec:
              command:
                - echo
                - '1'
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 5
      dnsPolicy: ClusterFirst
      securityContext:
        runAsUser: 0
        runAsGroup: 1
        fsGroup: 1
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchFields:
                  - key: metadata.name
                    operator: In
                    values:
                      - ip-13-0-1-185.ec2.internal
      schedulerName: default-scheduler
      dnsConfig:
        options:
          - name: timeout
            value: '10'
          - name: attempts
            value: '5'
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
status:
  observedGeneration: 1
  replicas: 1
  updatedReplicas: 1
  unavailableReplicas: 1
  conditions:
    - type: Available
      status: 'False'
      lastUpdateTime: '2022-12-01T13:30:47Z'
      lastTransitionTime: '2022-12-01T13:30:47Z'
      reason: MinimumReplicasUnavailable
      message: Deployment does not have minimum availability.
    - type: Progressing
      status: 'False'
      lastUpdateTime: '2022-12-01T13:40:48Z'
      lastTransitionTime: '2022-12-01T13:40:48Z'
      reason: ProgressDeadlineExceeded
      message: >-
        ReplicaSet
        "environment-shell-f1929b0d-c0b3-4153-a67e-474e3e9c8b61-8adc2c-898bd56bf"
        has timed out progressing.
👀 1
j

Jacopo

12/01/2022, 2:44 PM
kubectl get nodes
?
a

Alexsander Pereira

12/01/2022, 2:47 PM
image.png
Would that be the problem?
image.png
j

Jacopo

12/01/2022, 2:49 PM
Yes that's the issue, I believe I know what's the reason, give me a sec
a

Alexsander Pereira

12/01/2022, 2:49 PM
Orchest is trying to start the pods on a node that does not exist in the cluster
j

Jacopo

12/01/2022, 2:57 PM
I think this is part of a current limitation that we failed to document, give me some minutes to discuss things internally
a

Alexsander Pereira

12/01/2022, 3:05 PM
Ok, would this be related to the mounting issues as well?
j

Jacopo

12/01/2022, 3:05 PM
I don't believe the mounting issue is related
a

Alexsander Pereira

12/01/2022, 3:11 PM
Some environment shells also have the mount issue: Unable to attach or mount volumes: unmounted volumes=[userdir-pvc], unattached volumes=[container-runtime-socket kube-api-access-lwrbt userdir-pvc]: timed out waiting for the condition
image.png
j

Jacopo

12/01/2022, 3:12 PM
The reason the environment shell is failing is because orchest doesn't currently support cycling out nodes of the cluster, i.e. it works under the assumption that once a node is in it will stay as part of the cluster, the reason is pretty banal, it's part of an overarching effort to support multi node clusters and we are still working on this. I expected to be working on this particular thing 1 or 2 weeks from now but I'll try to prioritize this for tomorrow or next week and make a release. I'll help with a temporary solution that you can use to fix the situation and that needs to be reapplied if you are cycling out nodes again, it's basically just involves manipulating the orchest db state, start with this
Copy code
kubectl exec -it -n orchest deploy/orchest-database -- psql -U postgres -d orchest_api
then run
select * from cluster_nodes ;
Here we are interested about the nodes that are actually in the cluster and nodes that are not there anymore. We'll need to manipulate orchest state to make orchest not schedule anything on these nodes anymore
a

Alexsander Pereira

12/01/2022, 3:14 PM
image.png
j

Jacopo

12/01/2022, 3:15 PM
With
kubectl get nodes
you should find out which are the nodes that we want to remove
a

Alexsander Pereira

12/01/2022, 3:15 PM
Okay, I already know what they are
j

Jacopo

12/01/2022, 3:16 PM
then please run
select * from environment_images where not stored_in_registry;
, it's a safety check
a

Alexsander Pereira

12/01/2022, 3:16 PM
image.png
j

Jacopo

12/01/2022, 3:16 PM
that's good 🙂
alright, give me a sec to come up with another query that's just a safety check, in the meantime, please backup the content of the following tables :
cluster_nodes
,
environment_image_on_nodes
,
jupyter_image_on_node
a

Alexsander Pereira

12/01/2022, 3:20 PM
How can I make this backup? I would do it through dbeaver, but how could I connect to this database externally?
j

Jacopo

12/01/2022, 3:21 PM
Please run the following
Copy code
select node_name, count(*) from environment_image_on_nodes group by node_name;

select node_name, count(*) from jupyter_image_on_nodes group by node_name;
Here we want to know that/if the newer nodes have >= images than the ones that have been cycled out
About the backups, all the commands involved are pretty much just postgres management commands but run through kubectl exec, i.e.
kubectl exec -it -n orchest deploy/orchest-database -- <your command>
to create backups of the database and/or specific tables you can run
pg_dump
with the appropriate options
kubectl exec -it -n orchest deploy/orchest-database -- pg_dump <rest of flags>
a

Alexsander Pereira

12/01/2022, 3:23 PM
image.png
Copy code
kubectl exec -it -n orchest deploy/orchest-database -- pg_dump -U postgres -d orchest_api -t cluster_nodes > cluster_nodes.sql

kubectl exec -it -n orchest deploy/orchest-database -- pg_dump -U postgres -d orchest_api -t environment_image_on_nodes > environment_image_on_nodes.sql

kubectl exec -it -n orchest deploy/orchest-database -- pg_dump -U postgres -d orchest_api -t jupyter_image_on_nodes > jupyter_image_on_nodes.sql
It would be like this?
j

Jacopo

12/01/2022, 3:31 PM
Yeah it looks like something like that should work, note that the produced file is going to be stored in the container, you will need to use
kubectl cp
to get it out to your machine Please run
Copy code
SELECT node_name,
       count(*)
FROM environment_image_on_nodes ein
JOIN environment_images ei ON ein.project_uuid=ei.project_uuid
AND ein.environment_uuid=ei.environment_uuid
AND ein.environment_image_tag = ei.tag
WHERE NOT ei.marked_for_removal
GROUP BY node_name;
a

Alexsander Pereira

12/01/2022, 3:32 PM
It already came to my machine with that command
j

Jacopo

12/01/2022, 3:32 PM
oh 🤔 , well that's better 😛
a

Alexsander Pereira

12/01/2022, 3:32 PM
image.png
j

Jacopo

12/01/2022, 3:35 PM
Alright given these queries (and assuming you backed up those 3 tables) I think we can proceed with deleting the nodes that are not in the cluster and trying to spin up the session again
delete from cluster_nodes where name= '<your_node_that_does_not_exist_anymore>';
I guess you should be deleting 2 nodes of of the 3, according to your
kubectl get nodes
a

Alexsander Pereira

12/01/2022, 3:38 PM
image.png
image.png
j

Jacopo

12/01/2022, 3:39 PM
Alright, stop the session and try to spin up a new one
a

Alexsander Pereira

12/01/2022, 3:41 PM
here?
j

Jacopo

12/01/2022, 3:41 PM
yes
a

Alexsander Pereira

12/01/2022, 3:42 PM
image.png
Frontend is very unstable
It's not even loading the project files.
image.png
j

Jacopo

12/01/2022, 3:44 PM
Any chance you have to login again?
or is the node on which orchest is running under intense load?
a

Alexsander Pereira

12/01/2022, 3:45 PM
image.png
It's quiet, but even relogging, it's still very slow...
j

Jacopo

12/01/2022, 3:46 PM
🤔 can't really help you with that sadly, looks more like networking stuff, we run 2vcpus instances routinely
a

Alexsander Pereira

12/01/2022, 3:47 PM
😕
image.png
j

Jacopo

12/01/2022, 3:49 PM
What's the response of the API call?
a

Alexsander Pereira

12/01/2022, 3:49 PM
But the error is coming from the orchest itself, error 500
Internal Server Error
Only Internal Server Error
j

Jacopo

12/01/2022, 3:50 PM
can you try to start the session again and take a look at the logs of the orchest-api?
kubectl logs --follow -n orchest deploy/orchest-api
a

Alexsander Pereira

12/01/2022, 3:52 PM
logs-from-orchest-api-in-orchest-api-74dbf8f9cf-jcr7v.log
logs-from-orchest-database-in-orchest-database-549874b495-ltxmh.log
database errors
image.png
I'm restarting database
j

Jacopo

12/01/2022, 3:59 PM
I'm also seeing exceptions on `GET`s in there, e.g.
Exception on /api/sessions/ [GET]
, along with other stuff that makes it look like there is something wrong 🤔 , might be leftovers from the sessions trying to start
Can you try to restart orchest? If not, can you stop all sessions that are still trying to start?
a

Alexsander Pereira

12/01/2022, 4:00 PM
How can I do this restart?
Through the frontend itself?
j

Jacopo

12/01/2022, 4:01 PM
Either through the CLI or through the settings -> restart
yeah
a

Alexsander Pereira

12/01/2022, 4:02 PM
image.png
I believe it's because the database is restarting
image.png
image.png
j

Jacopo

12/01/2022, 4:03 PM
yeah no db no application eheh 😛
the restart of orchest should cleanup all existing sessions
so I'm curious if that will solve the issue
a

Alexsander Pereira

12/01/2022, 4:03 PM
kkkkkk
I hope that the issue of mounting the database does not occur, because this was related before
😧
What I feared happened
j

Jacopo

12/01/2022, 4:08 PM
kubectl get volumeattachments
?
I'm wondering if there are any stale attachments not allowing things to move on
a

Alexsander Pereira

12/01/2022, 4:08 PM
image.png
j

Jacopo

12/01/2022, 4:09 PM
I see
What's the state of the volume? And the PVC?
a

Alexsander Pereira

12/01/2022, 4:10 PM
here?
image.png
Bound, i think
In
df -h
i see the EBS volumes
j

Jacopo

12/01/2022, 4:16 PM
Mmh this really doesn't ring a bell 🤔, doesn't look like an Orchest issue, but I'd hate to be wrong Anything interesting in
kubectl get events
?
a

Alexsander Pereira

12/01/2022, 4:17 PM
image.png
I've been trying to see this issue since yesterday, but without a solution, this is preventing us from continuing with Orchest in production kkk
I'm running Orchest on EKS myself for development purposes and haven't had issues like this so far, admittedly the instance doesn't have real life workloads tho
a

Alexsander Pereira

12/01/2022, 4:21 PM
Yes, but orchest using gp2
And the aws link, I had already seen it, but I still don't know what it is
image.png
j

Jacopo

12/01/2022, 4:23 PM
Is this happening with only this node or has it been happening for different nodes? Have you tried restarting the node? I'm wondering if something went wrong in the lower part of the stack
a

Alexsander Pereira

12/01/2022, 4:24 PM
Copy code
Name:          userdir-pvc
Namespace:     orchest
StorageClass:  gp2
Status:        Bound
Volume:        pvc-7efa04de-55f2-4b7d-9a2a-8588152e1f4f
Labels:        <http://controller.orchest.io/component=userdir-pvc|controller.orchest.io/component=userdir-pvc>
               <http://controller.orchest.io/owner=cluster-1|controller.orchest.io/owner=cluster-1>
               <http://controller.orchest.io/part-of=orchest|controller.orchest.io/part-of=orchest>
               <http://orchest.io/orchest-hash=3|orchest.io/orchest-hash=3>
Annotations:   <http://controller.orchest.io/deploy-ingress|controller.orchest.io/deploy-ingress>: false
               <http://controller.orchest.io/k8s|controller.orchest.io/k8s>: eks
               <http://pv.kubernetes.io/bind-completed|pv.kubernetes.io/bind-completed>: yes
               <http://pv.kubernetes.io/bound-by-controller|pv.kubernetes.io/bound-by-controller>: yes
               <http://volume.beta.kubernetes.io/storage-provisioner|volume.beta.kubernetes.io/storage-provisioner>: <http://ebs.csi.aws.com|ebs.csi.aws.com>
               <http://volume.kubernetes.io/selected-node|volume.kubernetes.io/selected-node>: ip-13-0-1-113.ec2.internal
               <http://volume.kubernetes.io/storage-provisioner|volume.kubernetes.io/storage-provisioner>: <http://ebs.csi.aws.com|ebs.csi.aws.com>
Finalizers:    [<http://kubernetes.io/pvc-protection|kubernetes.io/pvc-protection>]
Capacity:      150Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       celery-worker-7db6d7c6d8-68qbk
               data-app-f6690730-6896-4e7fb6c1ef3f-da79-4486-6bdb74b77d-v4gx8
               environment-shell-268bec2c-0a2f-4b5a-ab69-4df9d5e514c6-0cdvr98s
               environment-shell-268bec2c-0a2f-4b5a-ab69-4df9d5e514c6-eaaj4sb5
               environment-shell-3099aa6b-0e6b-4c51-954c-5f6fefbf8fe3-201kbx9w
               environment-shell-3099aa6b-0e6b-4c51-954c-5f6fefbf8fe3-3e5p2b9r
               environment-shell-9c96381f-d12c-46bd-9214-7021b33984eb-169s8kr4
               environment-shell-b6b9af33-1531-44c1-bc66-3b9bf7999d29-c49ghzqf
               environment-shell-c56ab762-539c-4cce-9b1e-c4b00300ec6f-2da7kw5v
               environment-shell-de413fa4-dbc7-4d6a-9008-ca5faf512fe7-1ea2bkmn
               environment-shell-f1929b0d-c0b3-4153-a67e-474e3e9c8b61-8addlfkb
               fast-api-f6690730-6896-4e7fb6c1ef3f-da79-4486-5866f95f46-hvb56
               jupyter-eg-3113449b-3d31-442b97ef5146-99d7-4e82-8467cd467dwkmnh
               jupyter-eg-4634f845-a98d-482311e7b51d-d877-48e5-7fb7c875df6grv7
               jupyter-eg-8495ece7-ec3a-45819c759df9-278e-48ab-7b7d478d4d29rmq
               jupyter-eg-8495ece7-ec3a-4581fb5467fc-7db6-41c2-57b797c58b5lwrf
               jupyter-eg-85414ec5-acc7-491c17ff1121-e5ac-4b5c-6bc8778bc7nsb4q
               jupyter-eg-8647bf67-0b9c-4d5a07a9d9e0-cdfd-4944-7457954c8897l6q
               jupyter-eg-9ac69ab1-c85e-41ff0915b350-b929-4cbd-74bff56dfcvvrds
               jupyter-eg-dfac15de-be36-452174c195b0-910d-4aab-6d465db65c8p5vl
               jupyter-eg-f6690730-6896-4e7f8b1d2370-a3e1-44a3-f7d5697c8-d5vwk
               jupyter-eg-f6690730-6896-4e7fb6c1ef3f-da79-4486-5f784d6bb4rmsct
               jupyter-server-3113449b-3d31-442b97ef5146-99d7-4e82-bf7df9bzn9d
               jupyter-server-4634f845-a98d-482311e7b51d-d877-48e5-9bdfd59pgv6
               jupyter-server-8495ece7-ec3a-45819c759df9-278e-48ab-548cd7knql9
               jupyter-server-8495ece7-ec3a-4581fb5467fc-7db6-41c2-5df457qcn5q
               jupyter-server-85414ec5-acc7-491c17ff1121-e5ac-4b5c-7b7d846qj47
               jupyter-server-8647bf67-0b9c-4d5a07a9d9e0-cdfd-4944-7f487f5m9f2
               jupyter-server-9ac69ab1-c85e-41ff0915b350-b929-4cbd-5dccf67n7w7
               jupyter-server-dfac15de-be36-452174c195b0-910d-4aab-54dd67mkwbx
               jupyter-server-f6690730-6896-4e7f8b1d2370-a3e1-44a3-84bd4865n6b
               jupyter-server-f6690730-6896-4e7fb6c1ef3f-da79-4486-647d78s42mh
               orchest-api-74dbf8f9cf-jcr7v
               orchest-database-68676bc697-xws94
               orchest-webserver-7db6586774-qcq8q
               rabbitmq-server-57b9fd578c-5q2dg
               session-sidecar-f6690730-6896-4e7fb6c1ef3f-da79-4486-569f55dzjr
Events:        <none>
I already tried to restart the node, I tried to recreate the node, that's why cause the problem of sessions on different nodes.
But without success!
I think my last alternative would be to recreate the cluster and reinstall orchest and upload the projects again.
But without knowing the cause, I'm afraid it will happen again in customer environments.
j

Jacopo

12/01/2022, 4:27 PM
Then my suggestion would be to delete all jupyper-server, jupyter-eg and environment-shells deployments and see if that fixes it. Perhaps something has gone wrong at the lower level of the stack, e.g. kubelet or containerd and the volume is in a limbo
a

Alexsander Pereira

12/01/2022, 4:28 PM
I've done this before, that's how I managed to make the database, webserver and other services come back.
But the problem comes back when starting the jupyters and environment shells.
And if you restart some orchest container, they don't go up anymore.
j

Jacopo

12/01/2022, 4:29 PM
I see, but now we have fixed orchest state by deleting some nodes from
cluster_nodes
, so the issue with environment shells shouldn't be there anymore
Basically I'm wondering if the volume is stuck because these pods are mounting it while being "scheduled" to a node that doesn't exist anymore through affinities
a

Alexsander Pereira

12/01/2022, 4:36 PM
I don't think so, this problem started even before we recreated the node
j

Jacopo

12/01/2022, 4:38 PM
I see, then I'm really out of ideas here sadly
a

Alexsander Pereira

12/01/2022, 4:39 PM
One question, is there any way to use a userdir volume that is not an isolated EBS?
Couldn't I use the node's own file system?
r

Rafael Rodrigues Santana

12/01/2022, 4:39 PM
A NFS for example..
a

Alexsander Pereira

12/01/2022, 4:40 PM
NFS or node file system
Because I already have a disk attached to the node that would be enough, I don't know why Orchest creates this additional disk and uses mount with EBS CSI.
Would the userdir be the disk of the node itself?
j

Jacopo

12/01/2022, 4:50 PM
In the
<http://orchest.io/v1alpha1/orchestclusters|orchest.io/v1alpha1/orchestclusters>
(e.g.
kubectl -n orchest get orchestclusters
etc.) CRD you can find what behavior can be parameterized
One question, is there any way to use a userdir volume that is not an isolated EBS?
Couldn't I use the node's own file system?
I don't know why Orchest creates this additional disk and uses mount with EBS CSI
Currently, the storage class that is used for the volumes is the default storage class of the cluster, in this case EBS About the NFS and more specific/flexible support for storage, we need to chat internally about a couple of points before we can come back with a reply @Yannick @Rick Lamers
👀 1
a

Alexsander Pereira

12/01/2022, 5:08 PM
But would you be able to use an internal node folder for example?
What do you recommend me, recreating the cluster?
y

Yannick

12/01/2022, 5:40 PM
Given the extensive conversation here I propose scheduling a call for early next week (let's go for Monday so that you can be productive again ASAP). That gives us enough time to discuss what we can do internally. Would that work for you?
a

Alexsander Pereira

12/01/2022, 6:02 PM
Rick, we really need to solve this issue of mounting volumes. Then we discuss possible alternatives to the use of EBS.
Why some pods are able to mount userdir-pvc and others are not. We need to fix this 😕
@Yannick @Jacopo I opened a support ticket on AWS too, hope they can support us too.
r

Rick Lamers

12/01/2022, 10:44 PM
I agree with @Yannick that the best path forward seems to be jumping on call where we can discuss in real time 1) what you’re trying to accomplish and under which constraints 2) what kind of errors you’re getting and what various changes yield us 3) what we can change/add to the OSS to better support your intended setup. You can DM me for a Calendly link and we can get something on the calendar for next week.
r

Rafael Rodrigues Santana

12/02/2022, 2:36 PM
Hi guys, yesterday we were able to solve this issue. Firstly, we were suspecting that existed a limitation in the amount of mounts EBS support, but it seems that the limitation is just about disk attachments (something near 25 EBS per instance). After we ran a script sent by the AWS EKS support, we started to suspect that the problem was that the disk was too slow and, because of it, the disk mounts were timing out. We started to eliminate files inside that disk and we had discover that one of our users had created something like a million files because of a dataset that he has downloaded and unpacked. After deleting those files, the disk become a lot faster and the problem with the mounts stopped. Our deployment is now stable. Thank you very much for all the support you provided yesterday!
🚀 2
🙌 2
j

Jacopo

12/02/2022, 2:37 PM
We started to eliminate files inside that disk and we had discover that one of our users had created something like a million files because of a dataset that he has downloaded and unpackaged.
That's super interesting to know, will keep an eye out for this, thanks for the update, glad things are solved 🙂
a

Alexsander Pereira

12/02/2022, 2:40 PM
I believe it is not a common use case to have more than a million files in EBS. However, I would like to raise two points: 1 - Would it be possible for Orchest to use a userdir-pvc volume of type io1 or io2 (provisioned SSDs). I believe that in case of slowness increase the amount of IOPS solving the problem. 2 - We identified that the postgres data are in the same volume of the projects, I already raised this situation before, but what would need to be done to use an external postgres in the RDS? We are available to make a contribution to the project if you approve. Are we open to that?
@JacopoIf another customer goes through this, now you have the possible cause mapped out haha
j

Jacopo

12/02/2022, 2:41 PM
Indeed ahah 🙂
r

Rick Lamers

12/02/2022, 2:45 PM
@Alexsander Pereira the userdir-pvc will be able to use EFS indepedently of the other EBS volumes in an update we're releasing to the OSS soon. Does that help?
a

Alexsander Pereira

12/02/2022, 3:08 PM
@Rick Lamers EFS is slower than EBS, ideally EBS with provisioned IOPS
r

Rick Lamers

12/02/2022, 3:18 PM
For userdir-pvc in a multi-node setup we need a volume that can be attached to multiple nodes. That’s not possible with EBS volumes
🤔 1
a

Alexsander Pereira

12/02/2022, 3:22 PM
We don't need to have multiple nodes, it can be in only one, but it would not be possible to parameterize the EBS type from gp2 to io1. Since Kubernetes allows this?
j

Jacopo

12/02/2022, 3:23 PM
It will most likely be possible but it's something we are still actively working on so specifics and limitations aren't set in stone yet
a

Alexsander Pereira

12/02/2022, 3:39 PM
I see, let's map this limitation then. About RDS, any space for us to contribute?
Abount EBS with provisioned IOPS, wouldn't it just set the StorageClass in the cluster manifest?
image.png
r

Rick Lamers

12/02/2022, 5:09 PM
Abount EBS with provisioned IOPS, wouldn't it just set the StorageClass in the cluster manifest?
I don't see why in a single-node setup this wouldn't work indeed
I see, let's map this limitation then. About RDS, any space for us to contribute?
We'd welcome a PR that makes the database configurable. Something like "use a postgres container managed by the orchest-controller (default, what it is now) or accept RDS credentials and use that"
a

Alexsander Pereira

12/02/2022, 6:17 PM
@Rick Lamers About EBS, where would I set the amount of IOPS? Since the StorageClassName is predicted in the manifest. Can you confirm that this actually works?
@Rick Lamers I'll talk to the team and let you know if it's possible for us to contribute, can we have support if we have any questions? It would be the first contribution of the Dadosfera.
r

Rick Lamers

12/06/2022, 10:34 AM
Hi @Alexsander Pereira, missed these. I'm not sure whether IOPS would be configured at the k8s level, sounds like that would be in AWS itself.
can we have support if we have any questions?
Glad to answer specific questions about how things work in the current implementation to help you navigate implementing the RDS support 👍
4 Views