Summary
The user is facing intermittent errors with their Flyte cluster (v1.12.0
) when starting tasks, especially during high workload periods with multiple pods. These errors occur every 1-2 days, disrupting workflows and requiring manual intervention, as retries are ineffective. The user has increased grace period settings in config.K8sPluginConfig
but is looking for further configuration adjustments or solutions. They discuss the cluster size and pod creation rate, referencing a relevant PR that indicates the issue may stem from the container creation process taking too long. They mention that increasing the grace period to 10 minutes has reduced the frequency of errors. The conversation also touches on potential GCP-related issues and suggests escalating the matter to a GCP representative.