Summary
The user is experiencing intermittent RuntimeExecutionErrors with their self-hosted flyte-binary deployment on Oracle Cloud, particularly due to exceeding maximum system retry attempts, which they believe occurs under higher loads. They have questions about managing retry attempts, understanding the cause of failures, the function of webhooks, and the option to disable them. Additionally, they seek information on running Flyte on OCI. The user also inquires about adding memory/CPU to the pod and is advised to adjust deployment.resources
in the configuration file. They ask about adding replicas for the binary deployment but are informed that there is no leader election mechanism by default for handling multiple propeller instances in a single binary setup.
david.espejo
<@U062Y21KSQG> the leader election in propeller is not really about consistency because the controller itself is stateless, it records execution state in etcd
The mechanism is used to ensure that, while there may be multiple replicas
in the propeller deployment, only one instance is active at a time and the other(s) remain "warm" in case the leader fails.
Without leader election, K8s would still recreate the propeller pod in case of a failure but that could take a bit longer than just switching leaders. Also, the propeller replicas would be competing with each other, potentially trying to update the FlyteWorkflow CRD simultaneously.
guyarad
<@U04H6UUE78B> thanks! Re: leader-election - what do I care? there's a database no?
david.espejo
> I didn't find where I should add memory/cpu to the pod
You can uncomment and adjust deployment.resources
to override default resources for the Pod:
> Can I add replicas for the binary deployment? You can but there's no leader election mechanism enabled by default in single binary to handle properly multiple propeller instances so, for scaling out, flyte-core has these mechanisms available
guyarad
<@U04H6UUE78B> thanks for getting back to me! I didn't find where I should add memory/cpu to the pod. I'm currently using the binary deployment and maybe with high load it's not good enough. Can I add replicas for the binary deployment?
david.espejo
<@U062Y21KSQG> running Flyte on OCI! I want to learn more :slightly_smiling_face:
Seems like you're hitting the max-workflow-retries
which is <https://docs.flyte.org/en/latest/deployment/configuration/generated/flytepropeller_config.html#max-workflow-retries-int|set to 10> by default. There must be a good reason the worker is running out of retries budget and I'd suggest using the <https://grafana.com/grafana/dashboards/21719-flyte-propeller-dashboard-via-prometheus/|Grafana dashboard> and look at patterns, especially during high load to understand better.
Adding more resources to the Pod can help, but the next question would be: how much to add?
Let us know if that helps
kumare
please give it some memory
kumare
also this explains why its dying
kumare
you did not configure it from the helm?
guyarad
oddly, the flyte-binary pod has no resource requests/limts configured - how's that?
guyarad
I didn't see anything specific with the resources memory was high, but CPU low
kumare
Also check resources etc
kumare
Are you using secrets, if not you can disable the webhook
guyarad
Hi all, we have a self-hosted deployment of flyte-binary (on Orcale Cloud) and we started getting this error (only sometimes):
Workflow[flyte-tasks:production:some-workflow] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: worker error(s) encountered: [0]: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": EOF
It's possible it happens mostly on higher loads.
I have few questions:
propeller.disableWebhook
configuration, or maybe the delete the relevant webhook resource?)
Thanks!