Summary
The user is running the PyTorch Lightning MNIST example on their homelab computer but is facing issues with GPU usage. They received a warning about using cuda="12.1.0" and replaced it with conda_channels=["nvidia"]. Although they can execute the workflow on their cluster, the GPU shows 0% usage in nvidia-smi. The user is questioning the use of Elastic with a single GPU and is looking for ways to check the job's current status. They are advised that Elastic can be used with one GPU and that issues often arise from driver and library version mismatches. They are also prompted to check if the built image has the correct CUDA drivers matching their PyTorch version and to test with torch.cuda.is_available()
. Additionally, they can build the image locally using pyflyte build <script>.py <workflow/task_name>
.
niels
also, what does your Elastic
task config look like?
niels
It’s important to check if the built image has the correct cuda drivers that matches your pytorch version: test with torch.cuda.is_available()
You can do pyflyte build <script>.py <workflow/task_name>
to build the image locally
niels
<@U07S5N6R05D> what does your ImageSpec
look like?
kumare
You can use the elastic with one GPU. Usually the problem is the driver and the version of library mismatch.
kumare
Cc <@U0635LYB5PD>
kumare
Cc <@U01DYLVUNJE>
ch.braendli
I am running the PyTorch Lightning MNIST example (https://docs.flyte.org/en/latest/flytesnacks/examples/kfpytorch_plugin/pytorch_lightning_mnist_autoencoder.html) on my homelab computer. First it was complaining that I should not use cuda="12.1.0", so I replaced it with conda_channels=["nvidia"] and now I could start the execution of the workflow on my cluster but it does not seem to use the GPU at all nvidia-smi shows 0% volatile usage. I expected the fan to go crazy. Should I not use Elastic with only one GPU? How can I check what the job is doing right now?