Summary
The user is seeking assistance with applying tolerations
to driver/executor pods in the Flyte Spark plugin, as they are not being applied correctly despite their belief that the configuration is accurate. They have shared part of their flyte-backend
configuration and noted that tolerations are only appearing on other Flyte task pods. The user suggests that default-tolerations
may not work for Spark pods and recommends using plugins.spark.spark-config-default
for setting tolerations. They mention that the Spark operator Helm chart only shows tolerations for the controller and reference the operator API documentation, indicating potential limitations in configuring tolerations for Driver/Executor pods. However, they later report success by properly configuring the mutating webhook for the Spark Operator service, ensuring alignment between the namespaces/service accounts used by Flyte and the mutating webhook.
kumare
We should upstream this to spark
kumare
Ohh yes but this is because Spark driver kicks of pods and the driver code is old does not use pod specs sadly
josh.wills
Yo, just wanted to report back that I finally got this to work by properly configuring the mutating webhook for the Spark Operator service: https://www.kubeflow.org/docs/components/spark-operator/getting-started/#about-the-mutating-admission-webhook -- you just need to make sure it's setup on your SparkOperator + that you have alignment between the namespaces/service accounts that Flyte is using and that the mutating webhook is watching.
josh.wills
Hey <@U04H6UUE78B>! I think my fallback plan is to use the spark.kubernetes.{driver/executor}.podTemplateFile
property in those spark configs to create a pod template that includes the tolerations, I was just surprised b/c it looked like the Flyte spark.go
code was using those k8s.default-tolerations
(and the other settings for the pods under k8s
) to setup the default podspec that was getting passed in to the createSparkPodSpec
function from e.g. here: https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/k8s/spark/spark.go#L177
kumare
but, this might be something to look into
kumare
Aah the joys of spark operator
josh.wills
The tolerations show up on all of my flyte task pods except for the pods that get launched via the SparkOperator
josh.wills
Relevant section of the config for the flyte-backend
looks like this:
k8s:
default-env-vars:
- AWS_METADATA_SERVICE_TIMEOUT: 5
- AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
default-tolerations:
- effect: NoSchedule
key: datology-job-type
operator: Exists
inject-finalizer: true
spark:
spark-config-default:
- spark.eventLog.enabled: "true"
- spark.eventLog.dir: <s3a://dev-datologyai-job-logs/dev-next-spark-operator-logs>
- spark.eventLog.rolling.enabled: "true"
- spark.eventLog.rolling.maxFileSize: 16m
- spark.kubernetes.authenticate.submission.caCertFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
- spark.kubernetes.authenticate.submission.oauthTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
- spark.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
- spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
- spark.driver.extraJavaOptions: -Divy.cache.dir=/tmp -Divy.home=/tmp
storage:
cache:
max_size_mbs: 100
target_gc_percent: 100```
josh.wills
Hey all, I'm trying to propagate a some tolerations
to the driver/executor pods that get launched via the Flyte Spark plugin and I must be missing something on how this works; the relevant section of my configuration is in the :thread:, and I think I'm reading the relevant <https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/k8s/spark/spark.go#L144|bits of the Spark plugin> correctly, but for whatever reason my tolerations aren't making the leap from the configuration to the pods. Any help from folks who have figured this out before would be very much appreciated! :bow:
Relevant section of the config for the flyte-backend
looks like this:
k8s:
default-env-vars:
- AWS_METADATA_SERVICE_TIMEOUT: 5
- AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
default-tolerations:
- effect: NoSchedule
key: datology-job-type
operator: Exists
inject-finalizer: true
spark:
spark-config-default:
- spark.eventLog.enabled: "true"
- spark.eventLog.dir: <s3a://dev-datologyai-job-logs/dev-next-spark-operator-logs>
- spark.eventLog.rolling.enabled: "true"
- spark.eventLog.rolling.maxFileSize: 16m
- spark.kubernetes.authenticate.submission.caCertFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
- spark.kubernetes.authenticate.submission.oauthTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
- spark.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
- spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
- spark.driver.extraJavaOptions: -Divy.cache.dir=/tmp -Divy.home=/tmp
storage:
cache:
max_size_mbs: 100
target_gc_percent: 100```