Summary
The user is seeking a workflow orchestrator to build data pipelines due to issues with Prefect, such as long-running jobs losing connection to Azure containers and challenges in managing failed tasks. They are considering alternatives like Dagster, Restate, or Flyte but find it difficult to differentiate among them. Their requirements include backfilling around 100GB of data with simple transformations and managing demanding workflow-oriented jobs that involve processing tasks like file downloads and OCR, which need to run hundreds simultaneously. They faced reliability and traceability issues with Prefect, as they could not retry individual tasks or store task metadata, relying solely on logs for tracking. The user notes that Flyte is well-suited for long-running jobs, is Kubernetes-native, and offers features like recoverability, task-level retries, intra-task checkpointing, and caching, which could address their concerns.
habuelfutuh
Hey <@U07RRDNE62D> great to meet you!
I agree with all of the above, I think to me the fundamental difference between Flyte and Temporal is the responsibilities each assumed and subsequently what that says about the user's responsibility. Temporal is a workflow engine. It, very well, scales horizontally and help maximize resource utilization of the engine. It offers durable executions and historical data about executions and that's where its responsibilities end. This assumes the user is responsible for packaging their code as services, the user is responsible for scaling these services and accounting for surge traffic... etc. This is why this works very well for microservices orchestration where, once in production, the load is predictable and the patterns can be monitored. On the other hand, it requires users to think of their code as a set of services from the get go...
Flyte, on the other hand, encapsulates a powerful workflow engine but also assumes the responsibility of managing the entire infrastructure (thanks to it being k8s-native from day 1). This allows the user to write more intuitive code (function a
calls functions b
and c
and passes data between them). The code can run and be debugged locally because it's just your native python code, functions/tasks can have different dependencies and docker images if needed... etc. and Flyte will take care of spinning up the needed Pods to execute the workloads it's asked to execute on demand and scale down to 0 when everything is done.
Disclaimer: I work for Union the company sponsoring and maintaining Flyte.
Happy to jump on a call to hear more about what you are trying to build and answer questions.
kumare
cc <@UNW4VP36V>
kumare
try it once - and a quick way might be to use union serverless - http://signup.union.ai|signup.union.ai
kumare
<@U07RRDNE62D> reading your use case - I do feel Union / Flyte is perfect for this. As compared to temporal - you do not have to manage infra, you get versioning, automatic cluster scaling up and down on demand, pythonic workflows, containerizations, resource isolation, resource targeting like - cpu vs GPU. Flyte is more of an infrastructure platform and temporal a workflow engine. Workflows are a part of the thing that flyte offers, but does quite a bit more
john657
I don't know quite as much about Temporal, but as I understand it, Temporal is great for big-scale microservice orchestration. But they don't deal in data. The maximum amount of data you can pass between steps of your workflow is quite small.
john657
Flyte is also built to be more scalable - whereas IIRC some other tools listed are doing the orchestration in a Python runtime, which can't scale beyond a certain point, Flyte is running in Go in the backend and is built to run hundreds of thousands of concurrent processes
charles.liu
I've heard comments that temporal is also really well suited for use cases that have more complex workflows and not just the data piece like prefect/dagster. How would you say Flyte compares with temporal with regard to that?
john657
<@UNZB4NW3S> may want to comment further :slightly_smiling_face:
john657
The main difference between Flyte and prefect/dagster/airflow/other pure orchestrators is that it integrates the infrastructure. This architecture has 2 main benefits: • You don't need to manage 2 tools (i.e. prefect for orchestration, step functions/batch/k8s for compute) • It's more efficient, since the orchestrator knows about all the compute that is available. So you'll end up with higher utilization since you can run multiple tasks from different workflows concurrently on the same compute nodes
john657
Flyte is pretty well-equipped to handle long-running jobs. It's k8s-native, so it runs the containers natively alongside the orchestrator in k8s. I haven't heard of folks experiencing a workflow that loses connections to pods before.
Flyte really shines on the recoverability aspect. It saves intermediate inputs and outputs in an object store automatically, so if a task fails at the end of a workflow, you can click "https://docs.flyte.org/en/latest/user_guide/concepts/main_concepts/flyte_console.html#recovering-executions|recover" and it will hydrate the inputs to the failed task and continue on. There are also features around: • <https://docs.flyte.org/en/latest/user_guide/flyte_fundamentals/optimizing_tasks.html#retries|task-level retries> • <https://docs.flyte.org/en/latest/user_guide/advanced_composition/intratask_checkpoints.html#intratask-checkpoints|intra-task checkpointing> • https://docs.flyte.org/en/latest/user_guide/development_lifecycle/caching.html#caching|caching
charles.liu
Also, very curious as to what use cases Flyte would be more appropriate than Dagster or vice versa.
charles.liu
Hey John! Thanks for the quick response. The biggest problem we encountered with Prefect was that it seemed to be really buggy with long running jobs. It would lose connection to our Azure containers randomly, and when we cancelled jobs or jobs failed, a few tasks/flows would persist that we would have to manually clean up ourselves.
Other than that, we also had some issues with reliability & traceability. We weren't able to retry individual tasks or store metadata of the task directly, and relied solely on logs to keep track of that.
john657
Hey Charles, I'm a PM at Union and also focus a lot on Flyte. Could you describe what issues/limitations you had with Prefect? This might help answer whether Flyte could be more effective in those specific areas.
charles.liu
Hey we're looking for a workflow orchestrator to build some of our data pipelines. We were originally on Prefect, but encountered several bugs/limitations, so we're looking to switch. Currently, we're currently debating between dagster, restate, and flyte, but there aren't many resources to help us differentiate between the three.
Some more context on our needs: