cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Serving GPU Endpoint, can't find CUDA

kfab
New Contributor II

Hi everyone !
I'm encountering an issue while trying to serve my model on a GPU endpoint.
My model is using deespeed that needs I got the following error :

 

"An error occurred while loading the model. CUDA_HOME does not exist, unable to compile CUDA op(s)."

 

Not having access to the endpoint through a terminal makes it hard to debug the issue.
On the personal compute that I used to registered and test the model, cuda is installed and the model is working fine. Cuda is installed in /usr/local/cuda as it is mentioned in the documentation.

But on the endpoint it seems that it is not the case.

I first tried to set-up CUDA_HOME environment variable manually to '/usr/local/cuda' hoping it would work but it didn't. I got the following error :

 

"[Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc"

 

Now I'm starting to wondering if the endpoint computes do have CUDA installed, which would be weird if not right?

I runned this command from my model loading method to check if it could be installed eslswere but it returned nothing :

 

print(os.popen("ls -l /usr/local/").read())
print(os.popen("ls -l /opt/").read())
print(os.popen("nvcc --version").read())
print(os.popen("which nvcc").read())

 

[86bb6k8gpl] ls: cannot access '/usr/local/cuda': No such file or directory
[86bb6k8gpl] /bin/sh: 1: nvcc: not found

I'm pretty new to databricks so I may be missing something obvious, maybe it is installed to a custom location but hard to find it print by print.
Any help would be appreciated 😅

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @kfab

It seems you’re encountering an issue related to CUDA while serving your model on a GPU endpoint.

Let’s troubleshoot this step by step.

  1. CUDA_HOME Not Found: The error message you received, “CUDA_HOME does not exist, unable to compile CUDA op(s),” indicates that the CUDA_HOME environment variable is not properly set. This variable points to the location where CUDA is installed.

  2. Checking CUDA Installation: You mentioned that on your personal compute, CUDA is installed in /usr/local/cuda. However, on the endpoint, it appears that this path is not recognized. Let’s verify if CUDA is indeed installed on the endpoint.

  3. Verify CUDA Installation: You can check whether CUDA is installed by running the following commands in your model loading method:

    print(os.popen("ls -l /usr/local/").read())
    print(os.popen("ls -l /opt/").read())
    print(os.popen("nvcc --version").read())
    print(os.popen("which nvcc").read())
    

    The output you provided indicates that the /usr/local/cuda directory does not exist, and the nvcc command is not found. This suggests that CUDA might not be installed on the endpoint.

  4. Possible Solutions: Here are some steps you can take to resolve this issue:

    • Install CUDA on the Endpoint: Make sure that CUDA is installed on the endpoint where you’re serving your model. If it’s not, you’ll need to install it. You can download the CUDA Toolkit from the NVIDIA website.

    • Set CUDA_HOME Environment Variable: Once CUDA is installed, set the CUDA_HOME environment variable to the correct path. For example, if CUDA is installed in /usr/local/cuda-X.X, run the following command:

      export CUDA_HOME=/usr/local/cuda-X.X
      

      Replace X.X with the appropriate version number (you can find this using nvcc --version).

    • Reinstall Dependencies: After setting CUDA_HOME, try loading your model again. If you encounter any issues related to missing CUDA libraries, consider reinstalling the relevant dependencies (such as llama-cpp or other libraries) with the updated CUDA path.

    Remember to restart any relevant services or processes after making changes.

Feel free to follow up if you need further assistance or clarification! 🚀

 

kfab
New Contributor II

Hi @Kaniz_Fatma ,

thanks for your reply !

I managed to install Cuda via conda 👍

Also I was wondering, is there any way to ssh to the serving endpoint?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!