cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Ray cannot detect GPU on the cluster

Awoke101
New Contributor III

I am trying to run ray on databricks for chunking and embedding tasks. The cluster I’m using is:

g4dn.xlarge
1-4 workers with 4-16 cores
1 GPU and 16GB memory

I have set spark.task.resource.gpu.amount to 0.5 currently.

This is how I have setup my ray cluster:

setup_ray_cluster( min_worker_nodes=1, max_worker_nodes=3, num_gpus_head_node=1, )

And this is the chunking function:

@ray.remote(num_gpus=0.2)
def chunk_udf(row):
    texts = row["content"]
    data = row.copy()
    split_text = splitter.split_text(texts)
    split_text = [text.replace("\n", " ") for text in split_text]
    return list(zip(split_text,data))

When I run the flat_map function for chunking. It throws the following error:

chunked_ds = ds.flat_map(chunk_udf)
chunked_ds.show(5) 
At least one of the input arguments for this task could not be computed: ray.exceptions.RaySystemError: System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. 

Is there something I need to change in my setup?
torch.cuda.is_available() returns True in the notebook.

 

 

1 REPLY 1

Krishna_S
Databricks Employee
Databricks Employee

I have replicated all your steps and created the ray cluster exactly as you have done.

Also, I have set: spark.conf.set("spark.task.resource.gpu.amount", "0.5")

And I see a warning that shows that I don't allocate any GPU for Spark (as 1), even though I set it to 0.5

See the attached image and the error below.

You configured 'spark.task.resource.gpu.amount' to 1.0, we recommend setting this value to 0 so that Spark jobs do not reserve GPU resources, preventing Ray-on-Spark workloads from having the maximum number of GPUs available.

What likely happened is that since you set up the cluster to auto-scale, it probably did not scale as expected, causing Spark to use the only GPU on the node and resulting in the issue you are facing. 

 

 

(Virus scan in progress ...)