F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

PyTorch Lightning GPU Usage Issues

Summary

The user is running the PyTorch Lightning MNIST example on their homelab computer but is facing issues with GPU usage. They received a warning about using cuda="12.1.0" and replaced it with conda_channels=["nvidia"]. Although they can execute the workflow on their cluster, the GPU shows 0% usage in nvidia-smi. The user is questioning the use of Elastic with a single GPU and is looking for ways to check the job's current status. They are advised that Elastic can be used with one GPU and that issues often arise from driver and library version mismatches. They are also prompted to check if the built image has the correct CUDA drivers matching their PyTorch version and to test with torch.cuda.is_available(). Additionally, they can build the image locally using pyflyte build <script>.py <workflow/task_name>.

Status
resolved
Tags
  • Homelab
  • PyTorch Lightning
  • Workflow
  • flyte
  • PyTorch
  • GPU Issues
  • GPU Utilization
  • nvidia-smi
  • Question
  • Developer Help
  • CUDA
  • GPU Issue
  • GPU Usage
Source
#ask-the-community
    n

    niels

    10/23/2024

    also, what does your Elastic task config look like?

    n

    niels

    10/23/2024

    It’s important to check if the built image has the correct cuda drivers that matches your pytorch version: test with torch.cuda.is_available()

    You can do pyflyte build &lt;script&gt;.py &lt;workflow/task_name&gt; to build the image locally

    n

    niels

    10/23/2024

    <@U07S5N6R05D> what does your ImageSpec look like?

    k

    kumare

    10/22/2024

    You can use the elastic with one GPU. Usually the problem is the driver and the version of library mismatch.

    k

    kumare

    10/22/2024

    Cc <@U0635LYB5PD>

    k

    kumare

    10/22/2024

    Cc <@U01DYLVUNJE>

    c

    ch.braendli

    10/22/2024

    I am running the PyTorch Lightning MNIST example (https://docs.flyte.org/en/latest/flytesnacks/examples/kfpytorch_plugin/pytorch_lightning_mnist_autoencoder.html) on my homelab computer. First it was complaining that I should not use cuda="12.1.0", so I replaced it with conda_channels=["nvidia"] and now I could start the execution of the workflow on my cluster but it does not seem to use the GPU at all nvidia-smi shows 0% volatile usage. I expected the fan to go crazy. Should I not use Elastic with only one GPU? How can I check what the job is doing right now?