Summary
The user experienced a system error after upgrading flytekit
to version 1.13.7
while keeping the backend flyte
at version 1.13.1
. The error was linked to exceeding maximum retry attempts due to an invalid TaskSpecification, leading the user to question if the version mismatch was the cause and to seek clarification on compatibility. They noted the complexity of their code, which includes a custom task resolver and container image, and planned to check for version mismatches in the container configuration. The user has access to kubectl to describe the pod and identified the issue as related to map_task
. They intended to rerun the workflow after cleaning up previous pods and mentioned adjustments made to customize map_task
. Additionally, they planned to document more details in a GitHub issue, pinpointing the problem to primary_container_name
and its configuration. Ultimately, the issue was resolved by switching from legacy flytekitplugins-pod
to PodTemplate
, which worked successfully.
xinzhou
It turns out that the code was using legacy flytekitplugins-pod
. Once I switched to PodTemplate
(as advised https://github.com/flyteorg/flytekit/tree/master/plugins/flytekit-k8s-pod|here), it worked!
xinzhou
Ah, the issue is indeed primary_container_name
. If the task_config
of the task
is not overwritten, primary_container_name
will be the pod name. But if it’s overwritten with flytekitplugins.pod.task.Pod
, then primary_container_name
will be primary
, which causes the error.
curupa
For sure. Once we have more details, let's try to capture that in a gh issue.
xinzhou
Thanks for the tips, guys! I will report back
josh210
ah then yeah I would agree with you it's some version mismatch issue. But maybe the map task needs to be given the primary container name explicitly?
xinzhou
<@U07655DJTDM> I got:
primary_container_name: primary```
xinzhou
Let me check the code. We did a few tweaks to tailor map_task to our use cases, so very likely.
curupa
This was supposed to be a no-op, but something broke in your case.
curupa
<@U04UNGML8NB>, in <https://github.com/flyteorg/flytekit/commit/4767fd865eefbb576e501247f1bfdbcbc1462a51|flytekit 1.12.0> we switched the implementation of map_task
to use array nodes. Just to unblock you, you can still import the legacy map task from https://github.com/flyteorg/flytekit/blob/master/flytekit/core/legacy_map_task.py|here, but I'd love to understand what broke in your case.
xinzhou
I’ll rerun the workflow. The previous pods have been cleaned up.
josh210
can you put the describe pod output here? or the output of
kubectl -n datology-development describe pods <failing pod name> | grep primary_container_name
xinzhou
It seems to be related to map_task
xinzhou
yeah, I do have access to kubectl to describe the pod
josh210
do you have access to kubectl
for your cluster?
josh210
you should describe the pod and make sure it has a container named primary
xinzhou
Thanks for the pointer! The code is fairly complex with custom task resolver and custom container image, but I will check the container config to see if there is any mismatch between the flytekit versions.
josh210
can you send the workflow and tasks you're using? It looks like you might have a custom pod where the primary container is misnamed
xinzhou
When I upgraded flytekit
to 1.13.7
while the backend flyte
version was 1.13.1
, I got the following system error running remote workflow
RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: worker error(s) encountered: [0]: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [sidecar]: [BadTaskSpecification] invalid TaskSpecification, primary container [primary] not defined
Could it be caused by flytekit
version being ahead of flyte
version? How does version compatibility work?