F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Parallel Processing with Flyte

Summary

The user is seeking advice on how to implement parallel processing with Flyte for a large dataset (approximately 150GB) using a generator that yields rows of data. They want to process the data in batches, specifically by yielding a million rows at a time to a Flyte task that would start a container for processing. The user is concerned that using dynamic workflows or map_task in Flyte requires all data to be loaded in memory first, which contradicts their goal of batch processing. They prefer to avoid converting their Python code to PySpark and are looking for a way to achieve incremental batch processing with Flyte without needing to restructure their existing generator logic. They are asking for suggestions or solutions to accomplish this.

Status
open
Tags
    Source
    #ask-the-community