F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

CUDA Availability Issue in Flyte Task

Summary

The user is experiencing an issue where torch.cuda.is_available() returns False in a Flyte task, but True when using kubectl exec in the same pod. They are using a custom Docker image based on python:3.10-slim with flytekit-1.13.5 and are investigating potential issues related to different torch versions and CUDA mismatches. Despite confirming that the Python binary and torch location are the same in both scenarios, switching to a pytorch-cuda base image allowed Flyte to recognize CUDA. The user finds it odd that they could train a model using a GPU in the same image outside of Flyte and suspects a difference in the library loading path. A coworker suggested checking the user executing the task and related permissions.

Status
open
Tags
    Source
    #ask-the-community
      m

      miha.garafolj249

      9/26/2024

      <@U0662K01EUQ> sure, here is a somewhat anonymized version of it

      FROM python:3.10-slim
      
      
      # --- Set up google cloud, python etc.
      COPY --from=inventory /inventory/scripts /inventory/scripts
      RUN bash /inventory/scripts/base_python.sh &amp;&amp; \
          bash /inventory/scripts/google_cloud_sdk.sh &amp;&amp; \
          bash /inventory/scripts/kubectl.sh
      
      # Install lib dependencies
      ADD pyproject.toml pyproject.toml
      ADD poetry.lock poetry.lock
      RUN ~/.profile &amp;&amp; \
          poetry install --no-interaction --no-ansi --no-root &amp;&amp; \
          poetry -n cache clear --all .
      
      WORKDIR /workspace
      ENV VENV /opt/venv
      ENV LANG C.UTF-8
      ENV LC_ALL C.UTF-8
      ENV PYTHONPATH /workspace
      
      ADD module_code/ module_code
      
      # This tag is supplied by the build script and will be used to determine the version
      # when registering tasks, workflows, and launch plans
      ARG tag
      ENV FLYTE_INTERNAL_IMAGE $tag```
      Note that `poetry.lock` contains this , which installs cuda dependencies
      ```[package.dependencies]
      nvidia-cublas-cu11 = {version = "11.10.3.66", markers = "platform_system == \"Linux\""}
      nvidia-cuda-nvrtc-cu11 = {version = "11.7.99", markers = "platform_system == \"Linux\""}
      nvidia-cuda-runtime-cu11 = {version = "11.7.99", markers = "platform_system == \"Linux\""}
      nvidia-cudnn-cu11 = {version = "8.5.0.96", markers = "platform_system == \"Linux\""}```
      
      k

      kumare

      9/26/2024

      Hmm that’s interesting

      m

      miha.garafolj249

      9/26/2024

      my coworker proposed to check the user executing and the permissions related to it, could also be

      m

      miha.garafolj249

      9/26/2024

      right, was going after the same thing

      k

      kumare

      9/26/2024

      Ohh I believe you I am just saying how this can manifest

      m

      miha.garafolj249

      9/26/2024

      i swear, it was not

      k

      kumare

      9/26/2024

      It has to be difference in the library loading path

      m

      miha.garafolj249

      9/26/2024

      well anyway ive used a pytorch-cuda base image instead, and it got flyte to pick up CUDA

      still i find it strange that i was able to use a GPU to train a model in a same image, just outside flyte

      m

      miha.garafolj249

      9/26/2024

      I checked python binary and torch location and it's the same in both cases

          print("python:", sys.executable)```
      
      k

      kumare

      9/26/2024

      Different torch versions and mismatch with cuda?

      m

      miha.garafolj249

      9/26/2024

      Losing my nerves over this for the whole day, maybe someone here has an idea: • I can't get GPU available within the flyte task, i.e. torch.cuda.is_available() returns False • However, if I kubectl exec /bin/bash into the same pod, torch.cuda.is_available() would return True What could be possible reasons as to why that could be the case? I am using a custom docker image for registering tasks/workflows, image is based on python:3.10-slim and has flytekit-1.13.5 installed into it