Serving GPU Endpoint, can't find CUDA
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-14-2024 04:58 AM
Hi everyone !
I'm encountering an issue while trying to serve my model on a GPU endpoint.
My model is using deespeed that needs I got the following error :
"An error occurred while loading the model. CUDA_HOME does not exist, unable to compile CUDA op(s)."
Not having access to the endpoint through a terminal makes it hard to debug the issue.
On the personal compute that I used to registered and test the model, cuda is installed and the model is working fine. Cuda is installed in /usr/local/cuda as it is mentioned in the documentation.
But on the endpoint it seems that it is not the case.
I first tried to set-up CUDA_HOME environment variable manually to '/usr/local/cuda' hoping it would work but it didn't. I got the following error :
"[Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc"
Now I'm starting to wondering if the endpoint computes do have CUDA installed, which would be weird if not right?
I runned this command from my model loading method to check if it could be installed eslswere but it returned nothing :
print(os.popen("ls -l /usr/local/").read())
print(os.popen("ls -l /opt/").read())
print(os.popen("nvcc --version").read())
print(os.popen("which nvcc").read())
[86bb6k8gpl] ls: cannot access '/usr/local/cuda': No such file or directory
[86bb6k8gpl] /bin/sh: 1: nvcc: not found
I'm pretty new to databricks so I may be missing something obvious, maybe it is installed to a custom location but hard to find it print by print.
Any help would be appreciated ๐
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-20-2024 12:41 AM
Hi @Retired_mod ,
thanks for your reply !
I managed to install Cuda via conda ๐
Also I was wondering, is there any way to ssh to the serving endpoint?