F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Memory Usage Increase in Flyte Binary Pod

Summary

The user reports a significant increase in memory usage of the flyte-binary pod from 500 MB to 2 GB under heavy load, resulting in an OOM Kill. They are unsure if this is due to internal caching, dataCatalog, or if all workflows are stored in memory. The user requests a feature to delete completed workflows from memory and suggests increasing resources to 4 CPUs and over 4 GB of RAM, questioning if 4 GB is sufficient for about 10,000 daily pipeline triggers. They plan to investigate cache control's impact on memory scaling and emphasize the importance of monitoring, mentioning specific configuration parameters and performance optimization resources. The user suspects caching issues or a memory leak, noting that memory usage can rise to 8 GB before an OOM kill occurs, despite no workloads failing. They observe system crashes every 2 days with memory around 8 GB, although it does not frequently crash under production workload. They consider moving to flyte-core for additional scaling mechanisms and mention a temporary solution while planning to build a POC of Flyte-core. Additionally, they note that Flyte 1.13.2 contains a fix for the memory issue in flyte-binary and suggest planning an update on Monday.

Status
resolved
Tags
    Source
    #ask-the-community
      d

      divyank.agarwal

      10/5/2024

      ohh my bad. I thought the fix got release in 1.13.1

      d

      dubovikov.kirill

      10/4/2024

      <@U06MQ3WEUBS> could we plan an update on Monday?

      c

      curupa

      10/4/2024

      <@U06MQ3WEUBS>, Flyte 1.13.2 is https://github.com/flyteorg/flyte/releases/tag/v1.13.2|out and that contains a fix for this memory issue you're seeing in flyte-binary.

      d

      divyank.agarwal

      9/30/2024

      ok thanks.. We have a temporary solution. and We will start build a POC of Flyte-core

      d

      david.espejo

      9/30/2024

      with the amount of executions and what you expect (around 10k triggers a day) I'd say you should consider moving to flyte-core as there are additional mechanisms available to scale out

      d

      divyank.agarwal

      9/30/2024

      Let me know if any other metric is needed.

      d

      divyank.agarwal

      9/30/2024

      I can get that.

      d

      david.espejo

      9/30/2024

      <@U06MQ3WEUBS> do you have metrics on resource usage from the Pod? how many executions?

      d

      divyank.agarwal

      9/30/2024

      <@U0265RTUJ5B> added an observation above.. might be useful.

      d

      divyank.agarwal

      9/30/2024

      Memory here is around 8 GB.

      d

      divyank.agarwal

      9/30/2024

      an update on this problem. Under our production workload. It is not crashing every few hours. However it is still crashing every 2 days.