F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Checkpointing Error in Flytekit 1.13

Summary

The user is experiencing a "Checkpointing not available" error after upgrading to flytekit version 1.13, while using the flyte backend at the same version. They have verified that their checkpointing usage follows the documentation. The error occurs in the context manager when trying to access the checkpoint function, leading to a NotImplementedError. They also mention that in version 1.11, the map task retried successfully, but in version 1.13, it failed completely without retrying. The user suggests that for version 1.13 with ArrayNode, they may need to use metadata=TaskMetadata(cache=True, cache_version="0.1", retries=1).

Status
resolved
Tags
  • Error
  • Checkpointing
  • flyte
  • Error Reporting
  • Flytekit
  • Support Need
  • Developer
  • Error Report
  • Question
  • 1.13
  • 1.11
  • Bug Report
Source
#ask-the-community
    y

    ytong

    11/6/2024

    as long as you’re not relying on prev_exists to always be accurate yes

    r

    rupsha

    11/6/2024

    that’s all that’s needed right?

    r

    rupsha

    11/6/2024

    I’m updating the version

    y

    ytong

    11/6/2024

    1.13.13 fixes the cp

    y

    ytong

    11/6/2024

    update please

    r

    rupsha

    11/6/2024

    so do I need to just update.. or add the code you mentioned in the wf?

    y

    ytong

    11/6/2024

    i was able to get the code a <https://flyte-org.slack.com/archives/CP2HDHKE1/p1730766374955659?thread_ts=1730493448.135719&cid=CP2HDHKE1|few messages up> to run. (there’s an improvement i think we can make to the synccheckpoint class - prev_exists doesn’t seem to actually check if the folder exists, but that’s a separate issue we can address later)

    r

    rupsha

    11/6/2024

    I’ll give it a shot

    r

    rupsha

    11/6/2024

    thanks!

    y

    ytong

    11/6/2024

    could you bump to 13.13 also please <@U03HQE6THNV>?

    r

    rupsha

    11/5/2024

    Thanks.. I’ll give this a try

    y

    ytong

    11/5/2024

    which we will have out tomorrow. still adding tests to it.

    y

    ytong

    11/5/2024

    this will work as expected, the first two retries fail in the 3rd map task instance, and then succeeds. but the checkpoint code only works with that patch.

    y

    ytong

    11/5/2024

    This is the test code we’ve been running

    y

    ytong

    11/5/2024

    cc <@U0265RTUJ5B> who was also looking through this. I think it should be fine. The only thing is that you’ll need to raise a FlyteRecoverableException

    r

    rupsha

    11/4/2024

    1.13

    y

    ytong

    11/4/2024

    can you tell me what backend version you’re running please?

    y

    ytong

    11/4/2024

    still looking.

    r

    rupsha

    11/4/2024

    any way to debug why retries aren’t working?

    y

    ytong

    11/4/2024

    but that shouldn’t affect retries

    y

    ytong

    11/4/2024
    r

    rupsha

    11/4/2024

    <@UNR3C6Y4T> ^

    r

    rupsha

    11/4/2024

    Verified once again by downgrading flytekit that this is indeed the pattern

    1.11 Map task fails first time due to “Checkpointing not available”… then retries and succeeds

    1.13 Array node fails first time due to “Checkpointing not available”.. and then DOES NOT retry

    r

    rupsha

    11/4/2024

    even with the task metadata

    r

    rupsha

    11/4/2024

    AND it isn’t retrying the failed array node :neutral_face:

    r

    rupsha

    11/4/2024

    Is there a problem with using CP from map task/Array node ?

    r

    rupsha

    11/4/2024

    this is from the map task

    r

    rupsha

    11/4/2024

    Fails on the first line

    r

    rupsha

    11/4/2024
    encoded_name = cp.read()```
    
    y

    ytong

    11/4/2024

    mind sharing your cp code btw?

    y

    ytong

    11/4/2024

    let me make a repro

    r

    rupsha

    11/4/2024

    and same error about missing CP implementation

    r

    rupsha

    11/4/2024

    new code:

            run_mapmatch,
            concurrency=MAP_MATCHING_DEFAULT_CONCURRENCY,
            metadata=TaskMetadata(
                cache=True, cache_version="0.0.1", retries=NUM_RETRIES, interruptible=False
            ),
        )(partitioned_input=partitioned_inputs).with_overrides(
            limits=Resources(mem="1Gi"),
        )```
    
    r

    rupsha

    11/4/2024

    Tried that.. still no retries

    y

    ytong

    11/4/2024

    hey <@U03HQE6THNV> yeah, to control retries for the map task itself, you will need to set the task metadata field map_task(t1, metadata=TaskMetadata(retries=N))(a=1)

    r

    rupsha

    11/4/2024

    Looks like for 1.13 / ArrayNode I need to use this instead?

    metadata=TaskMetadata(cache=True, cache_version="0.1", retries=1)

    r

    rupsha

    11/4/2024

    How it’s invoked:

            run_function, concurrency=DEFAULT_CONCURRENCY
        )(partitioned_input=partitioned_inputs).with_overrides(
            limits=Resources(mem="1Gi"),
            retries=NUM_RETRIES,
        )```
    
    r

    rupsha

    11/4/2024

    This is with 1.11

    r

    rupsha

    11/4/2024

    This is with 1.13

    r

    rupsha

    11/4/2024

    This is with 1.11

    r

    rupsha

    11/4/2024

    <@UNR3C6Y4T> <@UNZB4NW3S> ^

    r

    rupsha

    11/4/2024

    I have cp in another workflow as well which works just fine even with 1.13… so not sure what’s causing the flakiness… and not retrying the failed map task is definitely concerning

    r

    rupsha

    11/4/2024

    With 1.11 the map task actually retried.. with 1.13 the map task (array node) did not retry and just failed completely

    r

    rupsha

    11/4/2024

    This is quite bizarre… The task has retries.. so first try fails with the error

            return wrapped(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
          File "/root/flyte/workflows/spark/mapmatcher.py", line 200, in run_mapmatch
            run_spark_app(app)
          File "/root/flyte/workflows/spark/utils.py", line 68, in run_spark_app
            cp = flytekit.current_context().checkpoint
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          File "/opt/venv/lib/python3.11/site-packages/flytekit/core/context_manager.py", line 259, in checkpoint
            raise NotImplementedError("Checkpointing is not available, please check the version of the platform.")
    
    Message:
    
        NotImplementedError: Checkpointing is not available, please check the version of the platform.```
    Then the next attempt is successful..
    
    r

    rupsha

    11/2/2024

    let me run it again and see if it happens again

    r

    rupsha

    11/2/2024

    went from 1.11 to 1.13

    k

    kumare

    11/2/2024

    It might be some config on her phone nd

    k

    kumare

    11/2/2024

    <@UNR3C6Y4T> can we just run a test on our platform that’s what I think

    y

    ytong

    11/2/2024

    checkpointing has not changed in a while.

    y

    ytong

    11/2/2024

    what version were you on before?

    k

    kumare

    11/2/2024

    If it’s an issue we will Have to file a bug

    k

    kumare

    11/2/2024

    Cc <@UNR3C6Y4T> can you check please

    k

    kumare

    11/2/2024

    This has not changed

    k

    kumare

    11/2/2024

    Sorry about that

    r

    rupsha

    11/1/2024

    This is an existing workflow that has now started failing :disappointed:

    k

    kumare

    11/1/2024

    that does not make sense, let us reproduce

    r

    rupsha

    11/1/2024

    Hi team, I’m running into a “Checkpointing not available” error after recently upgrading flytekit. I’m on flytekit==1.13 and flyte backend 1.13.

    Verified from the docs here that the https://docs.flyte.org/en/latest/user_guide/advanced_composition/intratask_checkpoints.html|usage is correct

                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          File "/opt/venv/lib/python3.11/site-packages/flytekit/core/context_manager.py", line 262, in checkpoint
            raise NotImplementedError("Checkpointing is not available, please check the version of the platform.")```