Summary
The user is facing memory usage problems with the Flyte propeller in their Flyte 1.9 deployment on AWS EKS, where the propeller container's memory usage increases over 24 hours, leading to out-of-memory errors and restarts. They suspect the issue may be more complex than simply needing to increase memory limits and are looking for insights into which propeller features might be causing the memory growth. The user has set the replicaset of the propeller to 2, but both pods are experiencing OOM and crashloopbackoffs. They plan to increase memory limits next and speculate that the pods may be crashing due to an initial fetch of past executions. Additionally, they have updated the data plane from version 1.9 to 1.12.
david.espejo
definitely it's not expected behavior
broder.peters
<@U04H6UUE78B> but based on what you've seen so far, this is not an expected behavior?
broder.peters
Nothing special in flyteadmin I would say. We currently don't have grafana setup and are only using datadog. I will have to check that out a bit later unfortunately. I will also double check the bug later.
david.espejo
There are <https://docs.flyte.org/en/latest/deployment/configuration/generated/flytepropeller_config.html#section-event|some settings> available for the EventSink but trying to refrain from changing things arbitrarily
david.espejo
<@U04NCU28PD0> what happens with flyteadmin when propeller crashes? does it work ok? From the error it looks like at some point propeller isn't able to post events to flyteadmin via the EventSink Using the Grafana dashboard one could observe if there's any pattern that leads to the OOM (<https://github.com/flyteorg/flyte/issues/5606|this bug> for example was isolated using that dashboard)
broder.peters
Another small addition: We are running it now with a replicaset of 2 and 400Mi memory limits and the pod where the memory goes down rapidly ran into OOM.
broder.peters
Interesting :thinking_face: I just set the replicaset of propeller to 2 and with the same memory both spawned pods run into OOM and crashloopbackoffs. Obviously I will go with memory increment next. On second thought: I guess they just have to do an initial fetch of past executions or something and therefore both crash. (Sidenote: With that I also updated that data plane from 1.9 to 1.12)
broder.peters
Describe of the propeller pod, main hint is the Last State
bit with OOMKilled
that I'm originating from
broder.peters
Those should be all the logs grouped into pattern of one container
broder.peters
With all the internal stuff plainly filtered out this one looks pretty boring tbh let me check if I can get more insights, also around the container that I can share
broder.peters
Workflow with plain tasks only resulting in 21 nodes. Highest resource limits are 4 CPUs and 4 Gi for one task. No cache. For the logs I need a bit time to filter out some internal stuff :sweat_smile:
david.espejo
<@U04NCU28PD0> what type of workflow is running on that data plane? (I mean, map tasks, dynamic, etc) Could you get logs from the propeller pod?
broder.peters
Hello, I'm trying to get a better understanding of the flyte propeller. (I've read a bit into <https://docs.flyte.org/en/latest/user_guide/concepts/component_architecture/flytepropeller_architecture.html|this doc> already, but not in all details yet) I would like to better understand the following case: We have Flyte 1.9 (yes, updating soon) deployed in AWS EKS with helm flyte-core charts and control and data plane in separate clusters. On this particular data plane every 15 minutes a simple workflow is executed. What we've noticed is that the propeller container of a fresh cluster starts growing in memory usage over the first 24 hours and then starting to run into OOM and restarts on the long run. Sometimes it's also not able to recover from it's own and continuously runs immediately into OOM. I'm aware that I could just increase the Memory limits to like 500MB, but I feel like the growing would just continue as I'm missing another crucial part here. Any hints which feature of the propeller might cause this, that I should look more into? (Second image the dashed line shows the restarts of the container with axis on the right)