F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

RuntimeExecutionErrors in Flyte Deployment

Summary

The user is experiencing intermittent RuntimeExecutionErrors with their self-hosted flyte-binary deployment on Oracle Cloud, particularly due to exceeding maximum system retry attempts, which they believe occurs under higher loads. They have questions about managing retry attempts, understanding the cause of failures, the function of webhooks, and the option to disable them. Additionally, they seek information on running Flyte on OCI. The user also inquires about adding memory/CPU to the pod and is advised to adjust deployment.resources in the configuration file. They ask about adding replicas for the binary deployment but are informed that there is no leader election mechanism by default for handling multiple propeller instances in a single binary setup.

Status
resolved
Tags
    Source
    #flyte-deployment
      d

      david.espejo

      10/31/2024

      <@U062Y21KSQG> the leader election in propeller is not really about consistency because the controller itself is stateless, it records execution state in etcd The mechanism is used to ensure that, while there may be multiple replicas in the propeller deployment, only one instance is active at a time and the other(s) remain "warm" in case the leader fails. Without leader election, K8s would still recreate the propeller pod in case of a failure but that could take a bit longer than just switching leaders. Also, the propeller replicas would be competing with each other, potentially trying to update the FlyteWorkflow CRD simultaneously.

      g

      guyarad

      10/31/2024

      <@U04H6UUE78B> thanks! Re: leader-election - what do I care? there's a database no?

      d

      david.espejo

      10/17/2024

      > I didn't find where I should add memory/cpu to the pod You can uncomment and adjust deployment.resourcesto override default resources for the Pod:

      https://github.com/flyteorg/flyte/blob/6c4f8dbfc6d23a0cd7bf81480856e9ae1dfa1b27/charts/flyte-binary/values.yaml#L235-L240

      > Can I add replicas for the binary deployment? You can but there's no leader election mechanism enabled by default in single binary to handle properly multiple propeller instances so, for scaling out, flyte-core has these mechanisms available

      g

      guyarad

      10/16/2024

      <@U04H6UUE78B> thanks for getting back to me! I didn't find where I should add memory/cpu to the pod. I'm currently using the binary deployment and maybe with high load it's not good enough. Can I add replicas for the binary deployment?

      d

      david.espejo

      10/16/2024

      <@U062Y21KSQG> running Flyte on OCI! I want to learn more :slightly_smiling_face:

      Seems like you're hitting the max-workflow-retrieswhich is <https://docs.flyte.org/en/latest/deployment/configuration/generated/flytepropeller_config.html#max-workflow-retries-int|set to 10> by default. There must be a good reason the worker is running out of retries budget and I'd suggest using the <https://grafana.com/grafana/dashboards/21719-flyte-propeller-dashboard-via-prometheus/|Grafana dashboard> and look at patterns, especially during high load to understand better. Adding more resources to the Pod can help, but the next question would be: how much to add? Let us know if that helps

      k

      kumare

      10/10/2024

      please give it some memory

      k

      kumare

      10/10/2024

      also this explains why its dying

      k

      kumare

      10/10/2024

      you did not configure it from the helm?

      g

      guyarad

      10/9/2024

      oddly, the flyte-binary pod has no resource requests/limts configured - how's that?

      g

      guyarad

      10/9/2024

      I didn't see anything specific with the resources memory was high, but CPU low

      k

      kumare

      10/9/2024

      Also check resources etc

      k

      kumare

      10/9/2024

      Are you using secrets, if not you can disable the webhook

      g

      guyarad

      10/9/2024

      Hi all, we have a self-hosted deployment of flyte-binary (on Orcale Cloud) and we started getting this error (only sometimes): Workflow[flyte-tasks:production:some-workflow] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: worker error(s) encountered: [0]: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": EOF It's possible it happens mostly on higher loads. I have few questions:

      1. Any way to increase or manage the retry attempts?
      2. How can we understand why this actually failed? (the flyte-binary pod was overloaded?)
      3. What does the webhook actually do?
      4. If the answer to [3] is not something important - can we disable it? (I noticed propeller.disableWebhook configuration, or maybe the delete the relevant webhook resource?) Thanks!