F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Flyte Propeller Memory Usage Issues

Summary

The user is facing memory usage problems with the Flyte propeller in their Flyte 1.9 deployment on AWS EKS, where the propeller container's memory usage increases over 24 hours, leading to out-of-memory errors and restarts. They suspect the issue may be more complex than simply needing to increase memory limits and are looking for insights into which propeller features might be causing the memory growth. The user has set the replicaset of the propeller to 2, but both pods are experiencing OOM and crashloopbackoffs. They plan to increase memory limits next and speculate that the pods may be crashing due to an initial fetch of past executions. Additionally, they have updated the data plane from version 1.9 to 1.12.

Status
resolved
Tags
    Source
    #flyte-deployment
      d

      david.espejo

      10/22/2024

      definitely it's not expected behavior

      b

      broder.peters

      10/18/2024

      <@U04H6UUE78B> but based on what you've seen so far, this is not an expected behavior?

      b

      broder.peters

      10/18/2024

      Nothing special in flyteadmin I would say. We currently don't have grafana setup and are only using datadog. I will have to check that out a bit later unfortunately. I will also double check the bug later.

      d

      david.espejo

      10/17/2024

      There are <https://docs.flyte.org/en/latest/deployment/configuration/generated/flytepropeller_config.html#section-event|some settings> available for the EventSink but trying to refrain from changing things arbitrarily

      d

      david.espejo

      10/17/2024

      <@U04NCU28PD0> what happens with flyteadmin when propeller crashes? does it work ok? From the error it looks like at some point propeller isn't able to post events to flyteadmin via the EventSink Using the Grafana dashboard one could observe if there's any pattern that leads to the OOM (<https://github.com/flyteorg/flyte/issues/5606|this bug> for example was isolated using that dashboard)

      b

      broder.peters

      10/17/2024

      Another small addition: We are running it now with a replicaset of 2 and 400Mi memory limits and the pod where the memory goes down rapidly ran into OOM.

      b

      broder.peters

      10/16/2024

      Interesting :thinking_face: I just set the replicaset of propeller to 2 and with the same memory both spawned pods run into OOM and crashloopbackoffs. Obviously I will go with memory increment next. On second thought: I guess they just have to do an initial fetch of past executions or something and therefore both crash. (Sidenote: With that I also updated that data plane from 1.9 to 1.12)

      b

      broder.peters

      10/16/2024

      Describe of the propeller pod, main hint is the Last State bit with OOMKilled that I'm originating from

      b

      broder.peters

      10/16/2024

      Those should be all the logs grouped into pattern of one container

      b

      broder.peters

      10/16/2024

      With all the internal stuff plainly filtered out this one looks pretty boring tbh let me check if I can get more insights, also around the container that I can share

      b

      broder.peters

      10/16/2024

      Workflow with plain tasks only resulting in 21 nodes. Highest resource limits are 4 CPUs and 4 Gi for one task. No cache. For the logs I need a bit time to filter out some internal stuff :sweat_smile:

      d

      david.espejo

      10/16/2024

      <@U04NCU28PD0> what type of workflow is running on that data plane? (I mean, map tasks, dynamic, etc) Could you get logs from the propeller pod?

      b

      broder.peters

      10/15/2024

      Hello, I'm trying to get a better understanding of the flyte propeller. (I've read a bit into <https://docs.flyte.org/en/latest/user_guide/concepts/component_architecture/flytepropeller_architecture.html|this doc> already, but not in all details yet) I would like to better understand the following case: We have Flyte 1.9 (yes, updating soon) deployed in AWS EKS with helm flyte-core charts and control and data plane in separate clusters. On this particular data plane every 15 minutes a simple workflow is executed. What we've noticed is that the propeller container of a fresh cluster starts growing in memory usage over the first 24 hours and then starting to run into OOM and restarts on the long run. Sometimes it's also not able to recover from it's own and continuously runs immediately into OOM. I'm aware that I could just increase the Memory limits to like 500MB, but I feel like the growing would just continue as I'm missing another crucial part here. Any hints which feature of the propeller might cause this, that I should look more into? (Second image the dashed line shows the restarts of the container with axis on the right)