F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Flyte Task Failures Diagnosis

Summary

The user initially faced task failures in Flyte after 10 retries due to an object modification error, which could lead to workflow abortion but is often recoverable. They were running Flyte on EKS with Fargate profiles and had a single flyte-binary scaled to 4 replicas. The user suspected the issue might be related to Flyte's management of nodes/pods or Fargate's operation and sought help for diagnosis. They were advised to enable the max-streak-length parameter and review documentation. After making updates, the user scaled the flyte-binary replicas down to 1, added the max-streak-length parameter, and allowed the vertical pod autoscaler to function. They believe the problem was due to multiple deployments of flyte-binary interacting in a race condition, and they now understand that memory bottlenecks should be addressed vertically rather than horizontally. The user is seeking confirmation on their understanding of the issue.

Status
resolved
Tags
    Source
    #flyte-deployment