Summary
The user is inquiring about the ability to perform multi-node distributed training on Flyte with fault tolerance, specifically using Intratask checkpoints and elastic torch. They mention discussions with scientists focused on large-scale distributed training who want to use spot instances while maintaining training continuity despite GPU failures. The user references a collaboration with individuals from LinkedIn who have custom checkpointing logic and expresses interest in further discussions. Additionally, they note that PyTorch has improved its native support for checkpointing.
kumare
<@U042Z2S8268> and folks from linkedin run distributed training, but have a custom checkpointing logic. I have talked to them about potentially collaborating on this. there was some interest. Byron can you remind me who it was from your team?
it also seems pytorch now has better support for checkpointing natively - https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html we can store the state store and the dataloader state
thomas316
Are there users that run multi-node distributed training on Flyte that is fault-tolerance? (By using Intratask checkpoints with elastic torch?)
I spoke to some scientists that do large scale distributed training, and all they want is a way to spin up spot instances and have it keep training even when some GPU go down.