Summary
The user is seeking best practices for managing workflows with large data sizes, specifically addressing output size limits that cause failures. They are currently using a Flytefile and considering a temporary fix by increasing the max-output-size-bytes
parameter. The user is looking for a more effective method to handle large input and output data, mentioning that JSON is passed inline and asking about the number of items in a list. They are also working on auto offloading support for large lists and a more compact representation of JSON. Additionally, they inquire about Pydantic data models and where to find information on which types are transferred inline versus offloaded.
habuelfutuh
This is probably the closest to what you are looking for: https://docs.flyte.org/en/latest/concepts/data_management.html
along
What is the "flyte way" to handle workflows with a lot of data? We have several workflows that handles 10s of GBs, and they're failing due to the output size -
is too large [28775519] bytes, max allowed [10485760] bytes
For now, we're passing a Flytefile instead of the actual data to overcome this issue, and I also understand another approach could be increasing the max-output-size-bytes
parameter, but this is only temporary, as data in the future could succeed this threshold. So - What is the proper way to handle large I/O?
along
What about Pydantic data models? Is there a place I can see which types are transfered in-line and which are offloaded?
habuelfutuh
If you are going to go with a dataframe, you can look at https://docs.flyte.org/projects/cookbook/en/v0.3.66/auto/core/type_system/structured_dataset.html|StructuredDatasets, they support strongly typed schemas that are compile-type validated... You can even combine it with https://pandera.readthedocs.io/en/stable/|Pandera to define rules around validation and have flyte automatically <https://docs.flyte.org/en/latest/flytesnacks/examples/pandera_plugin/index.html|kick these off>.
along
I'll try using a dataframe, but I wish it could be supported with some data type that supports typing hints.. could really help us (like list of dicts where I can define the schema)
kumare
Or csv
kumare
Or file
kumare
Or jsonl
kumare
Yes data frames
kumare
We have never seen a json list of 10gb
along
hundred of thousands jsons.. is there maybe a different data type that you support that will offload?
kumare
We are also working on more compact representation of json
kumare
Cc <@UNW4VP36V> <@UPBBNMXD1> <@UNR3C6Y4T>
kumare
We are indeed working on auto offloading support for large lists etc
kumare
How many items in the list
kumare
Aah yes JSON is passed inline
along
if so, what could be the reason of the max allowed size exception?
along
lists of jsons (dicts)
kumare
Only metadata is passed between tasks
kumare
Flyte will offload data
kumare
What is this data that you return that is unlined and 10gb