F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Flyte Binary Deployment Scale Issues

Summary

The user is experiencing scale issues with the Flyte Binary deployment on Oracle Cloud Infrastructure. They are considering switching to Flyte Core but are currently facing several problems: 1. The webhook service fails under load, and the current retry limit is insufficient.2. Flyte containers occasionally restart due to health check failures, with one instance consuming 25GB of memory despite having a 32GB limit.3. There is a concern about steadily increasing memory consumption, possibly indicating a memory leak.4. The user inquires about the possibility of increasing the number of replicas for the Flyte Binary, noting that they read it may not be feasible and seeks clarification on this.5. They question whether switching to Flyte Core would allow for more replicas to enhance availability during redeployment.The user also expresses confusion regarding the relevance of leader-election in relation to database usage.

Status
resolved
Tags
  • Oracle Cloud Infrastructure
  • Scale
  • flyte
  • Flyte
  • Deployment Issues
  • Product Help
  • Question
  • Performance
  • Bug Report
Source
#flyte-deployment
    k

    kumare

    10/31/2024

    It is possible, with leader election enabled, but we will have to test it and ensure it does indeed work

    d

    david.espejo

    10/31/2024

    Hey <@U062Y21KSQG> what flyte version are you running? there's a potential memory leak on flyte-binary that was fixed in 1.13.2

    g

    guyarad

    10/31/2024

    Scale issues with Flyte Binary: Hi all, we are deploying Flyte Binary in Oracle Cloud Infra. We should probably switch to Flyte Core deployment but that's what it is for now... We noticed few things:

    1. The webhook service fails under load, and the 10 retries aren't enough.
    2. Flyte containers restarts sometimes (healthcheck fails). I don't have a correlation between that and memory consumption but last time it failed Flyte container was used 25GB. The K8s node it was on still had memory available, and the limit I gave was 32GB. How can I diagnose why the health check failed? what should I look for?
    3. Memory consumption: seems like mem consumption steadily increases. Could it have a memory leak or something?
    4. Availability - is it possible to have more replicas for the Flyte binary? I read somewhere that it won't work. Can't seem to find the link. Can you share more details? is it correct? if so, why? and what can be done?
    5. If the answer to #4 is only 1 replica, using Flyte Core changes things? can we have more replicas to increase Flyte availability esp. during re-deployment? Thanks guys!