F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Intermittent errors in Flyte cluster

Summary

The user is facing intermittent errors with their Flyte cluster (v1.12.0) when starting tasks, especially during high workload periods with multiple pods. These errors occur every 1-2 days, disrupting workflows and requiring manual intervention, as retries are ineffective. The user has increased grace period settings in config.K8sPluginConfig but is looking for further configuration adjustments or solutions. They discuss the cluster size and pod creation rate, referencing a relevant PR that indicates the issue may stem from the container creation process taking too long. They mention that increasing the grace period to 10 minutes has reduced the frequency of errors. The conversation also touches on potential GCP-related issues and suggests escalating the matter to a GCP representative.

Status
resolved
Tags
    Source
    #ask-the-community