Hi everyone !
I'm encountering an issue while trying to serve my model on a GPU endpoint.
My model is using deespeed that needs I got the following error :
"An error occurred while loading the model. CUDA_HOME does not exist, unable to compile CUDA op(s)."
Not having access to the endpoint through a terminal makes it hard to debug the issue.
On the personal compute that I used to registered and test the model, cuda is installed and the model is working fine. Cuda is installed in /usr/local/cuda as it is mentioned in the documentation.
But on the endpoint it seems that it is not the case.
I first tried to set-up CUDA_HOME environment variable manually to '/usr/local/cuda' hoping it would work but it didn't. I got the following error :
"[Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc"
Now I'm starting to wondering if the endpoint computes do have CUDA installed, which would be weird if not right?
I runned this command from my model loading method to check if it could be installed eslswere but it returned nothing :
print(os.popen("ls -l /usr/local/").read())
print(os.popen("ls -l /opt/").read())
print(os.popen("nvcc --version").read())
print(os.popen("which nvcc").read())
[86bb6k8gpl] ls: cannot access '/usr/local/cuda': No such file or directory
[86bb6k8gpl] /bin/sh: 1: nvcc: not found
I'm pretty new to databricks so I may be missing something obvious, maybe it is installed to a custom location but hard to find it print by print.
Any help would be appreciated 😅