Summary
The user is experiencing an issue where torch.cuda.is_available()
returns False in a Flyte task, but True when using kubectl exec
in the same pod. They are using a custom Docker image based on python:3.10-slim
with flytekit-1.13.5
and are investigating potential issues related to different torch versions and CUDA mismatches. Despite confirming that the Python binary and torch location are the same in both scenarios, switching to a pytorch-cuda base image allowed Flyte to recognize CUDA. The user finds it odd that they could train a model using a GPU in the same image outside of Flyte and suspects a difference in the library loading path. A coworker suggested checking the user executing the task and related permissions.
miha.garafolj249
<@U0662K01EUQ> sure, here is a somewhat anonymized version of it
FROM python:3.10-slim
# --- Set up google cloud, python etc.
COPY --from=inventory /inventory/scripts /inventory/scripts
RUN bash /inventory/scripts/base_python.sh && \
bash /inventory/scripts/google_cloud_sdk.sh && \
bash /inventory/scripts/kubectl.sh
# Install lib dependencies
ADD pyproject.toml pyproject.toml
ADD poetry.lock poetry.lock
RUN ~/.profile && \
poetry install --no-interaction --no-ansi --no-root && \
poetry -n cache clear --all .
WORKDIR /workspace
ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /workspace
ADD module_code/ module_code
# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag```
Note that `poetry.lock` contains this , which installs cuda dependencies
```[package.dependencies]
nvidia-cublas-cu11 = {version = "11.10.3.66", markers = "platform_system == \"Linux\""}
nvidia-cuda-nvrtc-cu11 = {version = "11.7.99", markers = "platform_system == \"Linux\""}
nvidia-cuda-runtime-cu11 = {version = "11.7.99", markers = "platform_system == \"Linux\""}
nvidia-cudnn-cu11 = {version = "8.5.0.96", markers = "platform_system == \"Linux\""}```
kumare
Hmm that’s interesting
miha.garafolj249
my coworker proposed to check the user executing and the permissions related to it, could also be
miha.garafolj249
right, was going after the same thing
kumare
Ohh I believe you I am just saying how this can manifest
miha.garafolj249
i swear, it was not
kumare
It has to be difference in the library loading path
miha.garafolj249
well anyway ive used a pytorch-cuda base image instead, and it got flyte to pick up CUDA
still i find it strange that i was able to use a GPU to train a model in a same image, just outside flyte
miha.garafolj249
I checked python binary and torch location and it's the same in both cases
print("python:", sys.executable)```
kumare
Different torch versions and mismatch with cuda?
miha.garafolj249
Losing my nerves over this for the whole day, maybe someone here has an idea:
• I can't get GPU available within the flyte task, i.e. torch.cuda.is_available()
returns False
• However, if I kubectl exec /bin/bash
into the same pod, torch.cuda.is_available()
would return True
What could be possible reasons as to why that could be the case?
I am using a custom docker image for registering tasks/workflows, image is based on python:3.10-slim
and has flytekit-1.13.5
installed into it