Summary
The user is developing a cloud-based ML production platform called Flyte and is learning Terraform for deployment. They are inexperienced with non-managed clusters and are facing issues with 'terraform apply' due to multiple API installations. The user has created a new GCP project for Flyte and is seeking suggestions for improving their code and documentation. They plan to benchmark Flyte against systems like Slurm, Ray, and Kubeflow to assess its cloud feasibility. The user inquires about the expected maintenance hours for a Flyte cluster on GCP, noting that while Flyte is generally maintenance-free, their scale has required some management time and they have encountered surprises during upgrades. They are waiting for a fix from a colleague to deploy Flyte on GCP. Once set up, the platform runs smoothly, but occasional work is needed for upgrades or user-requested features. They mention that maintaining it themselves would require 1 or 2 skilled individuals in K8s, cloud providers, and infrastructure as code.
fabio.gratz
> We use Slurm, Ray, Kubeflow, and their deployment in cloud is easy.
Can’t speak for Slurm but Ray and Kubeflow can be installed in an existing K8s cluster with a single kubectl apply
or helm install
because all resources that are required are internal to the cluster. Flyte is a bit more complicated than that because it’s a more elaborate but also potent system that makes use of resources outside of the k8s cluster itself like blob storage, managed database, cloud provider IAM permissions, and you’ll need to configure a load balancer and authentication. The terraform module helps with that though.
fabio.gratz
But if you would like to maintain it yourself you’ll need 1 or 2 people who are good with K8s, cloud providers, infra as code etc.
fabio.gratz
Once we had the platform set up, it was smooth sailing. Occasional work on upgrades or when platform users would like a feature. Also happy to help on GCP <@U07MS09EZ47> :slightly_smiling_face:
roman.kazinnik
Thank you for getting back. Right now I am waiting for a fix for deploying Flyte on GCP from <@U04H6UUE78B>. I will give it another try once I hear back from him.
rafaelraposo
Let me know if you have any questions <@U07MS09EZ47>. Happy to help :slightly_smiling_face:
rafaelraposo
It's pretty much maintenance free for your everyday case but due to our scale it indeed took us some time to get there, we also have some special cases when it comes to the platform.
Make sure you size things correctly (like database) but there's not a size fits all.
We did had some surprises in a couple of upgrades but other than that it runs just fine.
kumare
It’s open source there are many folks that run Flyte. Flyte In our opinion is very resilient, but usecases, scale, integrations matter
Cc <@U03CLARPEJ0> (Spotify), <@U04664Z7H37> (recogni), <@U05R4A6N2DN> (Mercedes) may have better answers
roman.kazinnik
I would appreciate your advice. How many hours should we expect to spend maintaining Flyte cluster installed on GCP? Perhaps you have statistics of how many hours your clients spend maintaining FLyte clusters in cloud?
kumare
Got it
roman.kazinnik
Out plan was to evaluate Flyte, and if it works compare to Union. Eventually the goal is to see if Union extra features are worth .
roman.kazinnik
I need to try and eventually to recommend if our company can use Flyte in cloud. We use Slurm, Ray, Kubeflow, and their deployment in cloud is easy.
kumare
What does that mean
roman.kazinnik
I can deploy anything for mybenchmarking tests of Flyte.
kumare
Hi Roman, would it be better to deploy potentially union if you have low experience in terraform? This way you could do a test pretty swiftly
roman.kazinnik
<@U04H6UUE78B>
'terraform apply' failed several times asking me to install Cloud Resource Manager API, Cloud SQL Admin API, Service Usage API. Cloud Resource Manager API, now it is failing with the followin , screenshot attached:
What is the problem? I created a fresh new GCP project to try Flyte.
roman.kazinnik
:+1:
roman.kazinnik
Hi <@U04H6UUE78B> - I am going to create Flyte in our cloud to MVP Flyte as our new ML production platform.