Serving GPU Endpoint, can't find CUDA

kfab · ‎02-14-2024

Hi everyone !
I'm encountering an issue while trying to serve my model on a GPU endpoint.
My model is using deespeed that needs I got the following error :

"An error occurred while loading the model. CUDA_HOME does not exist, unable to compile CUDA op(s)."

Not having access to the endpoint through a terminal makes it hard to debug the issue.
On the personal compute that I used to registered and test the model, cuda is installed and the model is working fine. Cuda is installed in /usr/local/cuda as it is mentioned in the documentation.

But on the endpoint it seems that it is not the case.

I first tried to set-up CUDA_HOME environment variable manually to '/usr/local/cuda' hoping it would work but it didn't. I got the following error :

"[Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc"

Now I'm starting to wondering if the endpoint computes do have CUDA installed, which would be weird if not right?

I runned this command from my model loading method to check if it could be installed eslswere but it returned nothing :

print(os.popen("ls -l /usr/local/").read())
print(os.popen("ls -l /opt/").read())
print(os.popen("nvcc --version").read())
print(os.popen("which nvcc").read())

[86bb6k8gpl] ls: cannot access '/usr/local/cuda': No such file or directory
[86bb6k8gpl] /bin/sh: 1: nvcc: not found

I'm pretty new to databricks so I may be missing something obvious, maybe it is installed to a custom location but hard to find it print by print.
Any help would be appreciated 😅

kfab · ‎02-20-2024

Hi @Retired_mod ,

thanks for your reply !

I managed to install Cuda via conda 👍

Also I was wondering, is there any way to ssh to the serving endpoint?