Summary
The user seeks recommendations for modifying a workflow during the registration phase using a mutating webhook to add a task without user management, emphasizing the need for these changes to occur before registration due to compiler checks. They mention Flytekit's new on_failure
argument and question the necessity of a finally
option. Their use case involves managing Kubernetes resources through a custom task that should integrate with other tasks in a workflow, ensuring resources are terminated after training tasks, as Flyte currently lacks a mechanism for this. The user is considering creating another sensor agent for orchestrated garbage collection and highlights their choice of the Flyte agent route to leverage the framework without changes across the stack for easier testing and production operation. They note that Kevin has assisted with some initial agent use cases and problems, and express concern about the hackiness of invoking cleanup from a get
method, questioning if they can refer to Kubernetes resources created by a custom agent task in another workflow.
shuliang
btw, thanks for the brainstorming! I worked it out with custom sensor (knowing how to talk to k8s and do selective deletion) at the moment. Injecting the sensor node will be later for UX purpose. : )
shuliang
> what you’re describing sounds like a sidecar task. that is not part of the flyte crd api. a sidecar task is still a flyte task. it will be part of the flyte API/ecosystem, no?
> sidecar containers, init containers, within a pod template. actually, this might not be a bad idea if we need to go down to lower level : )
> a gh issue describing the scenario at a high level Will do while I am trying out approaches…
ytong
it’s maybe worth it to consider if the lifecycle management can be done via sidecar containers, init containers, within a pod template. but that is a very different lower level approach. almost might be worth it to make a gh issue describing the scenario at a high level. if there’s a lot of interest, maybe there is a pattern that is missing from the flyte/flytekit api.
ytong
no… what you’re describing sounds like a sidecar task. that is not part of the flyte crd api.
shuliang
no i mean is there API or configuration to do so?
ytong
manually mark as succeeded? no… any running node will indicate to propeller that the wf is still running.
shuliang
can we selected to tell the workflow as succeeded
or completed if one of the task that we can compelted/succeeded?
shuliang
if one task in the workflow is in Running state, the workflow will not be marked as completed, right?
ytong
i see
shuliang
task 1 & 2 is not sibling. but they have relationship internally. task 2 will cehck the reachability and query the graph/data from k8s resources created in task 1
ytong
oh task 1 & 2 are sibling tasks, that you want to share resource?
shuliang
my other option I mentioned is a sensor cleanup agent… but it may impact the user workflow completion
shuliang
but I think your idea to check in the get
call might work.
basically I marked the task 1 never succeeding, and so the get
call will always be invoked.
in the get
call, I will check the Task2 training job type execution status, if it is completed, I can just deleted the k8s resources.
shuliang
I am open to do any hacks atm lol
shuliang
yeah exactly. the distributed mult-node training for graph neural network is complicated everything tbd
ytong
but not a small undertaking, i can promise
ytong
the only way to access it is if you make it truly a shared resource (like how http://google.com|google.com is a shared resource). or rebuild the lifecycle management stuff we’ve built internally for the union Actor support.
shuliang
or make it even hackier:
I can mark the Task1 always running. :sweat_smile:
ytong
it can get cleaned up at any time.
ytong
cannot/should not right? that other workflow has it’s own lifecycle.
shuliang
> and cleanup can still be invoked from get…
good thought haha.
though get
is only invoked before task is succeeded
shuliang
yea I guess so…
ytong
ummm no, nodes (of any kind) are unique to a workflow.
shuliang
if it does not support it, which can be fine, then we are wondering if we can referring to the k8s resources that were created by a custom agent task in another workflow?
ytong
and cleanup can still be invoked from get… but that is very hacky. get should not be mutating
shuliang
for example, in addition to this, maybe I can put another thread does the flyte support sharing task node? say multiple workflows can refer to the same task node?
shuliang
yeah though I am a k8s developer, was preferring the custom k8s operator and backend plugins only need to create custom resources. but the changes were evaluated to a bit hairy than custom agent. but then we will need to sort out other problems :sweat_smile:
ytong
oh i agree. backend plugins are substantially more complicated.
shuliang
Kevin recently has helped some of my problems as well.
shuliang
Kevin has helped some of our initial agent use case as well. The reasons we chose flyte agent route is primarily leverage the flyte agent framework without changes across the stack related flyte proto, backend plugin, SDK and easy to test iterate and operate in production
shuliang
actually we have asked the question https://flyte-org.slack.com/archives/C06SYN9QJ5N/p1721885119188849
ytong
let me follow-up a bit internally
shuliang
the idempotency is handled by the K8s API itself. > also i wasn’t aware we had piped the k8s api through to agent tasks. oh yeah, we (from LinkedIn) currently is doing it. :wink:
ytong
also i wasn’t aware we had piped the k8s api through to agent tasks.
ytong
so an agent task configurable by the user, to run delete
as part of success. not sure if this is in-scope for the agent interface… and delete will need to be idempotent ofc.
shuliang
selectively, means: if users specify, they don’t want to delete the resources created in task 1, we will keep it no need to delete it. otherwise, we will cleanup the k8s resources from taks 1
ytong
when?
shuliang
basically like this:
— task 1 (custom agent task that manages the k8s resources) | -- Task 2 (MPIJob)
We want to selectively delete the k8s resources from task 1
ytong
the person who wrote most of that is out atm, but i’ll check with him to see if there was ever talk about adding a cleanup phase. is it possible to add the cleanup just as the last step before get returns success?
ytong
you’re saying delete gets called on Abort but not on success?
shuliang
yes
ytong
this is an AgentTask? like the <https://docs.flyte.org/en/latest/flyte_agents/index.html#flyte-agents-guide|Flyte Agent>?
shuliang
ahh in our case, we did not choose the operator route.
basically, we have a custom agent to create, get, delete the k8s resource based on the custom task configuration.
the k8s resource can be deleted nicely if the custom task is running state and gets aborted from the UI But once the task succeeded, the k8s resources are hanging there, (forevre)
ytong
usually when a custom task creates a custom k8s resource (like spark), it’s up to the operator cleanup, and i think propeller deletes the CRD (after some time)
shuliang
Thanks for asking the question. after the user defined workflow is done.
ytong
could you elaborate a bit? when do you want the cleanup to happen, asap? or after the whole workflow is done?
ytong
yeah that one, just as an fyi, not relevant here.
shuliang
> flytekit recently added an on_failure
argument…
you mean this PR: https://github.com/flyteorg/flytekit/pull/840?
shuliang
right now I was exploring creating another sensor agent to do the orchestrated GC work :sweat_smile:
shuliang
the use case is that we are managing the k8s resources from a custom task by agent, users can use that custom task with other types of task in a workflow. once that custom task is succeeded w.r.t. k8s resources are creates successfully and running and succeeded, there is no where in the flyte is killing the k8s resources. so we want to kill the k8s resources after the training tasks is done
shuliang
> flytekit recently added an on_failure
argument… are you looking for a finally
?
Sort of.
shuliang
> before registration is the place to do it, yeah, we can “force” users to specify that in the DSL. but what we are looking for is to inject that task/node without users explictly write it in DSL.
ytong
flytekit recently added an on_failure
argument… are you looking for a finally
?
ytong
what’s the use-case out of curiosity?
ytong
if you want to modify, before registration is the place to do it, since the compiler check comes after that.
shuliang
:thread:: is it recommended to modify the workflow during the registration phase before compilation/persistent to the storage? or perhaps some mutating webhook to modify the workflow? The use case is that we want to modify the users specified workflow with appending an additional task that users don’t need to care about in the users provided workflow.