Summary
The user is experiencing a "Checkpointing not available" error after upgrading to flytekit version 1.13, while using the flyte backend at the same version. They have verified that their checkpointing usage follows the documentation. The error occurs in the context manager when trying to access the checkpoint function, leading to a NotImplementedError. They also mention that in version 1.11, the map task retried successfully, but in version 1.13, it failed completely without retrying. The user suggests that for version 1.13 with ArrayNode, they may need to use metadata=TaskMetadata(cache=True, cache_version="0.1", retries=1)
.
ytong
as long as you’re not relying on prev_exists to always be accurate yes
rupsha
that’s all that’s needed right?
rupsha
I’m updating the version
ytong
1.13.13 fixes the cp
ytong
update please
rupsha
so do I need to just update.. or add the code you mentioned in the wf?
ytong
i was able to get the code a <https://flyte-org.slack.com/archives/CP2HDHKE1/p1730766374955659?thread_ts=1730493448.135719&cid=CP2HDHKE1|few messages up> to run. (there’s an improvement i think we can make to the synccheckpoint class - prev_exists
doesn’t seem to actually check if the folder exists, but that’s a separate issue we can address later)
rupsha
I’ll give it a shot
rupsha
thanks!
ytong
could you bump to 13.13 also please <@U03HQE6THNV>?
rupsha
Thanks.. I’ll give this a try
ytong
which we will have out tomorrow. still adding tests to it.
ytong
this will work as expected, the first two retries fail in the 3rd map task instance, and then succeeds. but the checkpoint code only works with that patch.
ytong
This is the test code we’ve been running
ytong
cc <@U0265RTUJ5B> who was also looking through this. I think it should be fine. The only thing is that you’ll need to raise a FlyteRecoverableException
rupsha
1.13
ytong
can you tell me what backend version you’re running please?
ytong
still looking.
rupsha
any way to debug why retries aren’t working?
ytong
but that shouldn’t affect retries
ytong
there’s an issue yeah: https://github.com/flyteorg/flytekit/pull/2898/files
rupsha
<@UNR3C6Y4T> ^
rupsha
Verified once again by downgrading flytekit that this is indeed the pattern
1.11 Map task fails first time due to “Checkpointing not available”… then retries and succeeds
1.13 Array node fails first time due to “Checkpointing not available”.. and then DOES NOT retry
rupsha
even with the task metadata
rupsha
AND it isn’t retrying the failed array node :neutral_face:
rupsha
Is there a problem with using CP from map task/Array node ?
rupsha
this is from the map task
rupsha
Fails on the first line
rupsha
encoded_name = cp.read()```
ytong
mind sharing your cp code btw?
ytong
let me make a repro
rupsha
and same error about missing CP implementation
rupsha
new code:
run_mapmatch,
concurrency=MAP_MATCHING_DEFAULT_CONCURRENCY,
metadata=TaskMetadata(
cache=True, cache_version="0.0.1", retries=NUM_RETRIES, interruptible=False
),
)(partitioned_input=partitioned_inputs).with_overrides(
limits=Resources(mem="1Gi"),
)```
rupsha
Tried that.. still no retries
ytong
hey <@U03HQE6THNV> yeah, to control retries for the map task itself, you will need to set the task metadata field
map_task(t1, metadata=TaskMetadata(retries=N))(a=1)
rupsha
Looks like for 1.13 / ArrayNode I need to use this instead?
metadata=TaskMetadata(cache=True, cache_version="0.1", retries=1)
rupsha
How it’s invoked:
run_function, concurrency=DEFAULT_CONCURRENCY
)(partitioned_input=partitioned_inputs).with_overrides(
limits=Resources(mem="1Gi"),
retries=NUM_RETRIES,
)```
rupsha
This is with 1.11
rupsha
This is with 1.13
rupsha
This is with 1.11
rupsha
<@UNR3C6Y4T> <@UNZB4NW3S> ^
rupsha
I have cp in another workflow as well which works just fine even with 1.13… so not sure what’s causing the flakiness… and not retrying the failed map task is definitely concerning
rupsha
With 1.11 the map task actually retried.. with 1.13 the map task (array node) did not retry and just failed completely
rupsha
This is quite bizarre… The task has retries.. so first try fails with the error
return wrapped(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/flyte/workflows/spark/mapmatcher.py", line 200, in run_mapmatch
run_spark_app(app)
File "/root/flyte/workflows/spark/utils.py", line 68, in run_spark_app
cp = flytekit.current_context().checkpoint
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/flytekit/core/context_manager.py", line 259, in checkpoint
raise NotImplementedError("Checkpointing is not available, please check the version of the platform.")
Message:
NotImplementedError: Checkpointing is not available, please check the version of the platform.```
Then the next attempt is successful..
rupsha
let me run it again and see if it happens again
rupsha
went from 1.11 to 1.13
kumare
It might be some config on her phone nd
kumare
<@UNR3C6Y4T> can we just run a test on our platform that’s what I think
ytong
checkpointing has not changed in a while.
ytong
what version were you on before?
kumare
If it’s an issue we will Have to file a bug
kumare
Cc <@UNR3C6Y4T> can you check please
kumare
This has not changed
kumare
Sorry about that
rupsha
This is an existing workflow that has now started failing :disappointed:
kumare
that does not make sense, let us reproduce
rupsha
Hi team, I’m running into a “Checkpointing not available” error after recently upgrading flytekit. I’m on flytekit==1.13 and flyte backend 1.13.
Verified from the docs here that the https://docs.flyte.org/en/latest/user_guide/advanced_composition/intratask_checkpoints.html|usage is correct
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/flytekit/core/context_manager.py", line 262, in checkpoint
raise NotImplementedError("Checkpointing is not available, please check the version of the platform.")```