F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Testing ArrayNode map_task Failures

Summary

The user is testing the ArrayNode map_task in version 1.13 with large jobs over 60,000 tasks, facing failures at node dn09, especially with the 3rd map_task. They have logs only for the 1st map_task and are verifying if their tests involve 60,000 subtasks, having previously tested with around 4,000. They mention a task limit of 5,000 and the need for 12 copies to run 60 due to etcd management issues. Each ArrayNode should not exceed 400, with dynamic workflows around 1200, and Flyte executions targeting about 60,000 for stress testing. After upgrading to Flytekit and Propeller 1.13, they are evaluating the new version's scalability for workload consolidation. The user increased resource requests on dynamic workflows and completed two runs of about 60,000 tasks, now testing a larger run of 150,000 tasks, which failed. They are investigating if there is an upper limit on the total number of nodes in an execution or a single dynamic workflow, noting issues with 50 dynamic workflows, each having 3 map_task and 2 tasks for materializing results, totaling around 250 nodes. They express doubt that the error "Last Error: UNKNOWN::Outputs not generated by task execution" is due to etcd limitations, suggesting it may stem from the storage client not locating the output file.

Status
resolved
Tags
    Source
    #ask-the-community