Summary
The user is inquiring about error propagation in Flyte's 2-node PyTorch elastic training setup, focusing on issues when both pods fail. They highlight the need to prioritize the first reported failure and discuss challenges with error handling, including ChildFailedError and RendezvousClosedError, which lead to inconsistent error reporting. The user is concerned about the effectiveness of the current PyTorch plugin in managing errors and suggests that returning the first error is complex, requiring a consensus mechanism. They propose modifying Flyte's entrypoint to allow different error file names based on group rank to select the earliest timestamp error and suggest a solution for managing multiple error files across plugins. The user expresses a willingness to collaborate and notes potential race conditions from task pods overwriting the same error file. They propose implementing a function to return a file name based on the group rank environment variable and creating a new interface, MultiErrorFileRemoteFileOutputReader
, to read error files based on the earliest timestamp. The user acknowledges the complexity of the situation and outlines an implementation plan that includes setting environment variables, adding entry point support for multi error file uploads, handling exception timestamps, and adding configuration support for the new reader interface.