F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Assistance with Flyte Spark tolerations

Summary

The user is seeking assistance with applying tolerations to driver/executor pods in the Flyte Spark plugin, as they are not being applied correctly despite their belief that the configuration is accurate. They have shared part of their flyte-backend configuration and noted that tolerations are only appearing on other Flyte task pods. The user suggests that default-tolerations may not work for Spark pods and recommends using plugins.spark.spark-config-default for setting tolerations. They mention that the Spark operator Helm chart only shows tolerations for the controller and reference the operator API documentation, indicating potential limitations in configuring tolerations for Driver/Executor pods. However, they later report success by properly configuring the mutating webhook for the Spark Operator service, ensuring alignment between the namespaces/service accounts used by Flyte and the mutating webhook.

Status
open
Tags
    Source
    #ask-the-community
      k

      kumare

      9/19/2024

      We should upstream this to spark

      k

      kumare

      9/19/2024

      Ohh yes but this is because Spark driver kicks of pods and the driver code is old does not use pod specs sadly

      j

      josh.wills

      9/19/2024

      Yo, just wanted to report back that I finally got this to work by properly configuring the mutating webhook for the Spark Operator service: https://www.kubeflow.org/docs/components/spark-operator/getting-started/#about-the-mutating-admission-webhook -- you just need to make sure it's setup on your SparkOperator + that you have alignment between the namespaces/service accounts that Flyte is using and that the mutating webhook is watching.

      j

      josh.wills

      9/18/2024

      Hey <@U04H6UUE78B>! I think my fallback plan is to use the spark.kubernetes.{driver/executor}.podTemplateFile property in those spark configs to create a pod template that includes the tolerations, I was just surprised b/c it looked like the Flyte spark.go code was using those k8s.default-tolerations (and the other settings for the pods under k8s) to setup the default podspec that was getting passed in to the createSparkPodSpec function from e.g. here: https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/k8s/spark/spark.go#L177

      k

      kumare

      9/18/2024

      but, this might be something to look into

      k

      kumare

      9/18/2024

      Aah the joys of spark operator

      j

      josh.wills

      9/17/2024

      The tolerations show up on all of my flyte task pods except for the pods that get launched via the SparkOperator

      j

      josh.wills

      9/17/2024

      Relevant section of the config for the flyte-backend looks like this:

        k8s:
          default-env-vars:
          - AWS_METADATA_SERVICE_TIMEOUT: 5
          - AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
          default-tolerations:
          - effect: NoSchedule
            key: datology-job-type
            operator: Exists
          inject-finalizer: true
        spark:
          spark-config-default:
          - spark.eventLog.enabled: "true"
          - spark.eventLog.dir: <s3a://dev-datologyai-job-logs/dev-next-spark-operator-logs>
          - spark.eventLog.rolling.enabled: "true"
          - spark.eventLog.rolling.maxFileSize: 16m
          - spark.kubernetes.authenticate.submission.caCertFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          - spark.kubernetes.authenticate.submission.oauthTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
          - spark.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
          - spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
          - spark.driver.extraJavaOptions: -Divy.cache.dir=/tmp -Divy.home=/tmp
      storage:
        cache:
          max_size_mbs: 100
          target_gc_percent: 100```
      
      j

      josh.wills

      9/17/2024

      Hey all, I'm trying to propagate a some tolerations to the driver/executor pods that get launched via the Flyte Spark plugin and I must be missing something on how this works; the relevant section of my configuration is in the :thread:, and I think I'm reading the relevant <https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/k8s/spark/spark.go#L144|bits of the Spark plugin> correctly, but for whatever reason my tolerations aren't making the leap from the configuration to the pods. Any help from folks who have figured this out before would be very much appreciated! :bow: Relevant section of the config for the flyte-backend looks like this:

        k8s:
          default-env-vars:
          - AWS_METADATA_SERVICE_TIMEOUT: 5
          - AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
          default-tolerations:
          - effect: NoSchedule
            key: datology-job-type
            operator: Exists
          inject-finalizer: true
        spark:
          spark-config-default:
          - spark.eventLog.enabled: "true"
          - spark.eventLog.dir: <s3a://dev-datologyai-job-logs/dev-next-spark-operator-logs>
          - spark.eventLog.rolling.enabled: "true"
          - spark.eventLog.rolling.maxFileSize: 16m
          - spark.kubernetes.authenticate.submission.caCertFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          - spark.kubernetes.authenticate.submission.oauthTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
          - spark.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
          - spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
          - spark.driver.extraJavaOptions: -Divy.cache.dir=/tmp -Divy.home=/tmp
      storage:
        cache:
          max_size_mbs: 100
          target_gc_percent: 100```