F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Out of Memory Errors with Flyte Pods

Summary

The user faced Out of Memory (OOM) errors with their pods, which were being killed. They tried to increase resource limits, but the settings reverted to defaults. They provided an example with Flyte, where task and workflow definitions had specified resource requests and limits, yet the pod limits still showed default values. The user's task resource configuration had default values of 500m CPU and 10Gi memory. A suggestion was made to upgrade the propeller's image due to a previous fix. The user resolved the issue by specifying both defaults and limits in the resource configuration, noting that limits might not be necessary since Flyte should default to requests=limits, which is better for the K8s scheduler. However, Flyte was not respecting this setting and reverted to defaults, indicating a need for adjustments in the flightadmin settings. They also encountered an error stating that the requested CPU limit exceeded the current limit set in the platform configuration. The user believes better documentation is needed, as the differences between defaults, requests, and limits are not well explained.

Status
open
Tags
    Source
    #ask-the-community
      e

      eric901201

      9/18/2024

      this is how it works in my memory

      e

      eric901201

      9/18/2024

      you have to make the limit of cpu in the admin's config larger than your request limit

      r

      rmalla

      9/17/2024

      It simply throws this error: Details: Requested CPU limit [2] is greater than current limit set in the platform configuration [500m]. Please contact Flyte Admins to change these limits or consult the configuration

      r

      rmalla

      9/17/2024

      David, I tried that, but Flyte is not resepcting the requests=limits, and is reverting to defaults. Without override, it shows flightadmin needs to adjust limits

      r

      rmalla

      9/17/2024

      Thanks for your help.

      r

      rmalla

      9/17/2024

      Hi Han-Ru, I believe I solved it. The issue was that in resource config, we need to specify both the defaults as well as limits, else it is ignoring. Like so: task_resources: defaults: cpu: 500m memory: 10Gi limits: cpu: 500m memory: 100Gi

      e

      eric901201

      9/17/2024

      after you set your config, did you restart your propeller?

      e

      eric901201

      9/17/2024

      I think you have to update your propeller's deployment to the latest

      e

      eric901201

      9/17/2024

      cc <@U04H6UUE78B>, can you help him use the latest flytepropeller image? I haven't had experience with the "Hard Way".

      r

      rmalla

      9/17/2024

      Han-Ru, I am using flyte-binary, and I have installed the latest version, using Helm Chart, as specified in the “Hard Way”.

      r

      rmalla

      9/17/2024

      Oh. let me check. Thanks

      e

      eric901201

      9/17/2024

      upgrade your propeller's image to the latest version.

      e

      eric901201

      9/17/2024

      it's fixed 4 month ago I think

      e

      eric901201

      9/17/2024

      did you use the latest propeller?

      r

      rmalla

      9/17/2024

      Hi there, my pods are getting killed with OOM. I tried increasing the limits, but it still defaults to the presets. I am trying this toy example:

      
      
      @task(
          requests=Resources(
              cpu="2",
              mem="0.5Gi",
          ),
          limits=Resources(
                  cpu="2",
                  mem="0.5Gi",
              ),
      )
      def foo():
          print('task')
      
      @workflow
      def my_wf():
          foo()
          foo().with_overrides(
              requests=Resources(
                  cpu="1",
                  mem="2Gi",
              ),
              limits=Resources(
                  cpu="1",
                  mem="4Gi",
              ),
          )```
      Here is the output of limits on the pod:
      ```NAMESPACE                 POD                                             CONTAINER                      MEM_REQ   MEM_LIM   CPU_REQ   CPU_LIM
      flyte                     flyte-backend-flyte-binary-548f5d59fc-ln6q8     flyte                          &lt;none&gt;    &lt;none&gt;    &lt;none&gt;    &lt;none&gt;
      flytesnacks-development   azpd24c4qpc2w2jlqvhz-n0-0                       azpd24c4qpc2w2jlqvhz-n0-0      512Mi     512Mi     2         2
      flytesnacks-development   azpd24c4qpc2w2jlqvhz-n1-0                       azpd24c4qpc2w2jlqvhz-n1-0      1Gi       1Gi       1         1```
      Here is the task resource config:
      ```  task_resources:
            defaults:
              cpu: 500m
              memory: 10Gi```