F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Workflow Modification Recommendations

Summary

The user seeks recommendations for modifying a workflow during the registration phase using a mutating webhook to add a task without user management, emphasizing the need for these changes to occur before registration due to compiler checks. They mention Flytekit's new on_failure argument and question the necessity of a finally option. Their use case involves managing Kubernetes resources through a custom task that should integrate with other tasks in a workflow, ensuring resources are terminated after training tasks, as Flyte currently lacks a mechanism for this. The user is considering creating another sensor agent for orchestrated garbage collection and highlights their choice of the Flyte agent route to leverage the framework without changes across the stack for easier testing and production operation. They note that Kevin has assisted with some initial agent use cases and problems, and express concern about the hackiness of invoking cleanup from a get method, questioning if they can refer to Kubernetes resources created by a custom agent task in another workflow.

Status
resolved
Tags
    Source
    #ask-the-community
      s

      shuliang

      9/19/2024

      btw, thanks for the brainstorming! I worked it out with custom sensor (knowing how to talk to k8s and do selective deletion) at the moment. Injecting the sensor node will be later for UX purpose. : )

      s

      shuliang

      9/19/2024

      > what you’re describing sounds like a sidecar task. that is not part of the flyte crd api. a sidecar task is still a flyte task. it will be part of the flyte API/ecosystem, no?

      > sidecar containers, init containers, within a pod template. actually, this might not be a bad idea if we need to go down to lower level : )

      > a gh issue describing the scenario at a high level Will do while I am trying out approaches…

      y

      ytong

      9/19/2024

      it’s maybe worth it to consider if the lifecycle management can be done via sidecar containers, init containers, within a pod template. but that is a very different lower level approach. almost might be worth it to make a gh issue describing the scenario at a high level. if there’s a lot of interest, maybe there is a pattern that is missing from the flyte/flytekit api.

      y

      ytong

      9/19/2024

      no… what you’re describing sounds like a sidecar task. that is not part of the flyte crd api.

      s

      shuliang

      9/19/2024

      no i mean is there API or configuration to do so?

      y

      ytong

      9/19/2024

      manually mark as succeeded? no… any running node will indicate to propeller that the wf is still running.

      s

      shuliang

      9/19/2024

      can we selected to tell the workflow as succeeded or completed if one of the task that we can compelted/succeeded?

      s

      shuliang

      9/19/2024

      if one task in the workflow is in Running state, the workflow will not be marked as completed, right?

      y

      ytong

      9/19/2024

      i see

      s

      shuliang

      9/19/2024

      task 1 & 2 is not sibling. but they have relationship internally. task 2 will cehck the reachability and query the graph/data from k8s resources created in task 1

      y

      ytong

      9/19/2024

      oh task 1 & 2 are sibling tasks, that you want to share resource?

      s

      shuliang

      9/19/2024

      my other option I mentioned is a sensor cleanup agent… but it may impact the user workflow completion

      s

      shuliang

      9/19/2024

      but I think your idea to check in the get call might work. basically I marked the task 1 never succeeding, and so the get call will always be invoked. in the get call, I will check the Task2 training job type execution status, if it is completed, I can just deleted the k8s resources.

      s

      shuliang

      9/19/2024

      I am open to do any hacks atm lol

      s

      shuliang

      9/19/2024

      yeah exactly. the distributed mult-node training for graph neural network is complicated everything tbd

      y

      ytong

      9/19/2024

      but not a small undertaking, i can promise

      y

      ytong

      9/19/2024

      the only way to access it is if you make it truly a shared resource (like how http://google.com|google.com is a shared resource). or rebuild the lifecycle management stuff we’ve built internally for the union Actor support.

      s

      shuliang

      9/19/2024

      or make it even hackier:

      I can mark the Task1 always running. :sweat_smile:

      y

      ytong

      9/19/2024

      it can get cleaned up at any time.

      y

      ytong

      9/19/2024

      cannot/should not right? that other workflow has it’s own lifecycle.

      s

      shuliang

      9/19/2024

      > and cleanup can still be invoked from get…
      good thought haha. though get is only invoked before task is succeeded

      s

      shuliang

      9/19/2024

      yea I guess so…

      y

      ytong

      9/19/2024

      ummm no, nodes (of any kind) are unique to a workflow.

      s

      shuliang

      9/19/2024

      if it does not support it, which can be fine, then we are wondering if we can referring to the k8s resources that were created by a custom agent task in another workflow?

      y

      ytong

      9/19/2024

      and cleanup can still be invoked from get… but that is very hacky. get should not be mutating

      s

      shuliang

      9/19/2024

      for example, in addition to this, maybe I can put another thread does the flyte support sharing task node? say multiple workflows can refer to the same task node?

      s

      shuliang

      9/19/2024

      yeah though I am a k8s developer, was preferring the custom k8s operator and backend plugins only need to create custom resources. but the changes were evaluated to a bit hairy than custom agent. but then we will need to sort out other problems :sweat_smile:

      y

      ytong

      9/19/2024

      oh i agree. backend plugins are substantially more complicated.

      s

      shuliang

      9/19/2024

      Kevin recently has helped some of my problems as well.

      s

      shuliang

      9/19/2024

      Kevin has helped some of our initial agent use case as well. The reasons we chose flyte agent route is primarily leverage the flyte agent framework without changes across the stack related flyte proto, backend plugin, SDK and easy to test iterate and operate in production

      s

      shuliang

      9/19/2024
      y

      ytong

      9/19/2024

      let me follow-up a bit internally

      s

      shuliang

      9/19/2024

      the idempotency is handled by the K8s API itself. > also i wasn’t aware we had piped the k8s api through to agent tasks. oh yeah, we (from LinkedIn) currently is doing it. :wink:

      y

      ytong

      9/19/2024

      also i wasn’t aware we had piped the k8s api through to agent tasks.

      y

      ytong

      9/19/2024

      so an agent task configurable by the user, to run delete as part of success. not sure if this is in-scope for the agent interface… and delete will need to be idempotent ofc.

      s

      shuliang

      9/19/2024

      selectively, means: if users specify, they don’t want to delete the resources created in task 1, we will keep it no need to delete it. otherwise, we will cleanup the k8s resources from taks 1

      y

      ytong

      9/19/2024

      when?

      s

      shuliang

      9/19/2024

      basically like this:

      — task 1 (custom agent task that manages the k8s resources) | -- Task 2 (MPIJob)

      We want to selectively delete the k8s resources from task 1

      y

      ytong

      9/19/2024

      the person who wrote most of that is out atm, but i’ll check with him to see if there was ever talk about adding a cleanup phase. is it possible to add the cleanup just as the last step before get returns success?

      y

      ytong

      9/19/2024

      you’re saying delete gets called on Abort but not on success?

      s

      shuliang

      9/19/2024

      yes

      y

      ytong

      9/19/2024

      this is an AgentTask? like the <https://docs.flyte.org/en/latest/flyte_agents/index.html#flyte-agents-guide|Flyte Agent>?

      s

      shuliang

      9/19/2024

      ahh in our case, we did not choose the operator route.

      basically, we have a custom agent to create, get, delete the k8s resource based on the custom task configuration.

      the k8s resource can be deleted nicely if the custom task is running state and gets aborted from the UI But once the task succeeded, the k8s resources are hanging there, (forevre)

      y

      ytong

      9/19/2024

      usually when a custom task creates a custom k8s resource (like spark), it’s up to the operator cleanup, and i think propeller deletes the CRD (after some time)

      s

      shuliang

      9/19/2024

      Thanks for asking the question. after the user defined workflow is done.

      y

      ytong

      9/19/2024

      could you elaborate a bit? when do you want the cleanup to happen, asap? or after the whole workflow is done?

      y

      ytong

      9/19/2024

      yeah that one, just as an fyi, not relevant here.

      s

      shuliang

      9/19/2024

      > flytekit recently added an on_failure argument… you mean this PR: https://github.com/flyteorg/flytekit/pull/840?

      s

      shuliang

      9/19/2024

      right now I was exploring creating another sensor agent to do the orchestrated GC work :sweat_smile:

      s

      shuliang

      9/19/2024

      the use case is that we are managing the k8s resources from a custom task by agent, users can use that custom task with other types of task in a workflow. once that custom task is succeeded w.r.t. k8s resources are creates successfully and running and succeeded, there is no where in the flyte is killing the k8s resources. so we want to kill the k8s resources after the training tasks is done

      s

      shuliang

      9/19/2024

      > flytekit recently added an on_failure argument… are you looking for a finally? Sort of.

      s

      shuliang

      9/19/2024

      > before registration is the place to do it, yeah, we can “force” users to specify that in the DSL. but what we are looking for is to inject that task/node without users explictly write it in DSL.

      y

      ytong

      9/19/2024

      flytekit recently added an on_failure argument… are you looking for a finally?

      y

      ytong

      9/19/2024

      what’s the use-case out of curiosity?

      y

      ytong

      9/19/2024

      if you want to modify, before registration is the place to do it, since the compiler check comes after that.

      s

      shuliang

      9/18/2024

      :thread:: is it recommended to modify the workflow during the registration phase before compilation/persistent to the storage? or perhaps some mutating webhook to modify the workflow? The use case is that we want to modify the users specified workflow with appending an additional task that users don’t need to care about in the users provided workflow.