Machine Learning

Forum Posts

Sorted by:

by sanjay • Valued Contributor II

02-09-2023 7:25:49 AM

38118 Views
2 replies
1 kudos

Resolved! torch.cuda.OutOfMemoryError: CUDA out of memory

Hi,I am using pynote/whisper large model and trying to process data using spark UDF and getting following error.torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 14.76 GiB total capacity; 6.07 GiB already allocated...

Machine Learning

38118 Views
2 replies
1 kudos

02-09-2023 7:25:49 AM

View Replies

Latest Reply

JMTech18
New Contributor II

09-23-2024 4:34:40 AM

1 kudos

Try to run these codesimport torchtorch.cuda.empty_cache()And make sure to find the optimize batch size otherwise the error can occur again

1 kudos

09-23-2024 4:34:40 AM

1 More Replies

by ppang • New Contributor III

05-20-2023 3:23:29 PM

1527 Views
1 replies
0 kudos

Does Databricks Container Services (DCS) support for GPU containers with Databricks Runtime 11.3 LTS and higher?

I have been trying to start a cluster using DCS with GPU containers (https://github.com/databricks/containers/tree/master/ubuntu/gpu), but was only successful with Databricks Runtime 10.4 LTS and lower. With Databricks Runtime 11.3 LTS and higher, I ...

Machine Learning

1527 Views
1 replies
0 kudos

05-20-2023 3:23:29 PM

View Replies

Latest Reply

jessysantos
Databricks Employee

05-27-2024 3:33:26 PM

0 kudos

Hello @ppang ! Since you posted your question, the repository you shared has received an update, which includes the following warning: "Using conda in DCS images is no longer supported starting Databricks Runtime 9.0. We highly recommend users to ext...

0 kudos

05-27-2024 3:33:26 PM

by DataBRObin • New Contributor III

05-30-2023 8:13:55 AM

2213 Views
2 replies
0 kudos

Running Keras model training with HorovodRunner works until the training function is exited ("The MPI_Query_thread() function was called after MPI_FINALIZE was invoked.")

I am running training of a Keras/Tensorflow deep learning model on a cluster of (for now) 2 workers and 1 driver (T4 GPU, 28GB, 4 core) using the Databricks provided HorovodRunner. It all seems to go well and the performance scales quite nicely over ...

Machine Learning

2213 Views
2 replies
0 kudos

05-30-2023 8:13:55 AM

View Replies

Latest Reply

sean_owen
Databricks Employee

06-02-2023 5:57:31 PM

0 kudos

I personally suspect it's your callbacks. Can you remove all those state callbacks and see if that is it?

0 kudos

06-02-2023 5:57:31 PM

1 More Replies

by alisher_pwc • New Contributor II

03-03-2023 6:13:31 AM

3626 Views
2 replies
1 kudos

Model serving with GPU cluster

Hello Databricks community!We are facing a strong need of serving some of public and our private models on GPU clusters and we have several requirements:1) We'd like to be able to start/stop the endpoints (best with scheduling) to avoid excess consum...

Machine Learning

3626 Views
2 replies
1 kudos

03-03-2023 6:13:31 AM

View Replies

Latest Reply

Vartika
Databricks Employee

03-31-2023 12:35:12 AM

1 kudos

Hi @Alisher Akh Does @Debayan Mukherjee's answer help? If yes, would you be happy to mark the answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you further. Cheers!

1 kudos

03-31-2023 12:35:12 AM

1 More Replies

by zzy • New Contributor III

10-14-2022 10:07:02 AM

2093 Views
2 replies
2 kudos

Why is GPU accelerated node much slower than CPU node for training a random forest model on databricks?

I have a dataset about 5 million rows with 14 features and a binary target. I decided to train a pyspark random forest classifier on Databricks. The CPU cluster I created contains 2 c4.8xlarge workers (60GB, 36core) and 1 r4.xlarge (31GB, 4core) driv...

Machine Learning

2093 Views
2 replies
2 kudos

10-14-2022 10:07:02 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

10-20-2022 5:40:36 AM

2 kudos

In many cases, you need to adjust your code to utilize GPU.

2 kudos

10-20-2022 5:40:36 AM

1 More Replies

Databricks Community

Resolved! torch.cuda.OutOfMemoryError: CUDA out of memory

Does Databricks Container Services (DCS) support for GPU containers with Databricks Runtime 11.3 LTS and higher?

Running Keras model training with HorovodRunner works until the training function is exited ("The MPI_Query_thread() function was called after MPI_FINALIZE was invoked.")

Model serving with GPU cluster

Why is GPU accelerated node much slower than CPU node for training a random forest model on databricks?