F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Optimizing S3/GCS File Handling in Flyte

Summary

The user is optimizing file and directory handling in S3/GCS for a Flyte workflow that processes two text files from a remote directory. They currently have separate tasks for retrieving and processing files, but Kubernetes adds overhead due to temporary pod creation. To enhance efficiency, they are considering consolidating these tasks into one step while using a local temporary directory structure before outputting a FlyteDirectory. The user faces issues with code snippets that work in a sandbox but fail in their single cluster deployment, especially when listing files in a FlyteDirectory. They want to avoid relying solely on the GS client for reading from a bucket and are looking for tips to optimize workflow speed, including parallelizing file processing and reusing pods to reduce overhead. They also discuss challenges with defining workflow inputs as FlyteDirectory when files are only in a bucket and express uncertainty about the functionality of flytedir.download() without a local copy of the files. The user notes that pod spin-up time significantly impacts total processing time, especially with small files, and believes that refactoring tasks could save several seconds, depending on allocated cluster resources.

Status
resolved
Tags
    Source
    #ask-the-community