F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Flytekit Upgrade Error Resolution

Summary

The user experienced a system error after upgrading flytekit to version 1.13.7 while keeping the backend flyte at version 1.13.1. The error was linked to exceeding maximum retry attempts due to an invalid TaskSpecification, leading the user to question if the version mismatch was the cause and to seek clarification on compatibility. They noted the complexity of their code, which includes a custom task resolver and container image, and planned to check for version mismatches in the container configuration. The user has access to kubectl to describe the pod and identified the issue as related to map_task. They intended to rerun the workflow after cleaning up previous pods and mentioned adjustments made to customize map_task. Additionally, they planned to document more details in a GitHub issue, pinpointing the problem to primary_container_name and its configuration. Ultimately, the issue was resolved by switching from legacy flytekitplugins-pod to PodTemplate, which worked successfully.

Status
resolved
Tags
    Source
    #ask-the-community
      x

      xinzhou

      10/4/2024

      It turns out that the code was using legacy flytekitplugins-pod. Once I switched to PodTemplate (as advised https://github.com/flyteorg/flytekit/tree/master/plugins/flytekit-k8s-pod|here), it worked!

      x

      xinzhou

      10/4/2024

      Ah, the issue is indeed primary_container_name. If the task_config of the task is not overwritten, primary_container_name will be the pod name. But if it’s overwritten with flytekitplugins.pod.task.Pod, then primary_container_name will be primary, which causes the error.

      c

      curupa

      10/4/2024

      For sure. Once we have more details, let's try to capture that in a gh issue.

      x

      xinzhou

      10/4/2024

      Thanks for the tips, guys! I will report back

      j

      josh210

      10/4/2024

      ah then yeah I would agree with you it's some version mismatch issue. But maybe the map task needs to be given the primary container name explicitly?

      x

      xinzhou

      10/4/2024

      <@U07655DJTDM> I got:

                        primary_container_name: primary```
      
      x

      xinzhou

      10/4/2024

      Let me check the code. We did a few tweaks to tailor map_task to our use cases, so very likely.

      c

      curupa

      10/4/2024

      This was supposed to be a no-op, but something broke in your case.

      c

      curupa

      10/4/2024

      <@U04UNGML8NB>, in <https://github.com/flyteorg/flytekit/commit/4767fd865eefbb576e501247f1bfdbcbc1462a51|flytekit 1.12.0> we switched the implementation of map_task to use array nodes. Just to unblock you, you can still import the legacy map task from https://github.com/flyteorg/flytekit/blob/master/flytekit/core/legacy_map_task.py|here, but I'd love to understand what broke in your case.

      x

      xinzhou

      10/4/2024

      I’ll rerun the workflow. The previous pods have been cleaned up.

      j

      josh210

      10/4/2024

      can you put the describe pod output here? or the output of kubectl -n datology-development describe pods &lt;failing pod name&gt; | grep primary_container_name

      x

      xinzhou

      10/4/2024

      It seems to be related to map_task

      x

      xinzhou

      10/4/2024

      yeah, I do have access to kubectl to describe the pod

      j

      josh210

      10/4/2024

      do you have access to kubectl for your cluster?

      j

      josh210

      10/4/2024

      you should describe the pod and make sure it has a container named primary

      x

      xinzhou

      10/4/2024

      Thanks for the pointer! The code is fairly complex with custom task resolver and custom container image, but I will check the container config to see if there is any mismatch between the flytekit versions.

      j

      josh210

      10/4/2024

      can you send the workflow and tasks you're using? It looks like you might have a custom pod where the primary container is misnamed

      x

      xinzhou

      10/4/2024

      When I upgraded flytekit to 1.13.7 while the backend flyte version was 1.13.1, I got the following system error running remote workflow RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: worker error(s) encountered: [0]: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [sidecar]: [BadTaskSpecification] invalid TaskSpecification, primary container [primary] not defined Could it be caused by flytekit version being ahead of flyte version? How does version compatibility work?