cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

ppang
by New Contributor III
  • 972 Views
  • 1 replies
  • 0 kudos

Does Databricks Container Services (DCS) support for GPU containers with Databricks Runtime 11.3 LTS and higher?

I have been trying to start a cluster using DCS with GPU containers (https://github.com/databricks/containers/tree/master/ubuntu/gpu), but was only successful with Databricks Runtime 10.4 LTS and lower. With Databricks Runtime 11.3 LTS and higher, I ...

  • 972 Views
  • 1 replies
  • 0 kudos
Latest Reply
jessysantos
New Contributor III
  • 0 kudos

Hello @ppang ! Since you posted your question, the repository you shared has received an update, which includes the following warning: "Using conda in DCS images is no longer supported starting Databricks Runtime 9.0. We highly recommend users to ext...

  • 0 kudos
DataBRObin
by New Contributor III
  • 1353 Views
  • 2 replies
  • 0 kudos

Running Keras model training with HorovodRunner works until the training function is exited ("The MPI_Query_thread() function was called after MPI_FINALIZE was invoked.")

I am running training of a Keras/Tensorflow deep learning model on a cluster of (for now) 2 workers and 1 driver (T4 GPU, 28GB, 4 core) using the Databricks provided HorovodRunner. It all seems to go well and the performance scales quite nicely over ...

  • 1353 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

I personally suspect it's your callbacks. Can you remove all those state callbacks and see if that is it?

  • 0 kudos
1 More Replies
alisher_pwc
by New Contributor II
  • 2054 Views
  • 2 replies
  • 1 kudos

Model serving with GPU cluster

Hello Databricks community!We are facing a strong need of serving some of public and our private models on GPU clusters and we have several requirements:1) We'd like to be able to start/stop the endpoints (best with scheduling) to avoid excess consum...

  • 2054 Views
  • 2 replies
  • 1 kudos
Latest Reply
Vartika
Moderator
  • 1 kudos

Hi @Alisher Akh​ Does @Debayan Mukherjee​'s answer help? If yes, would you be happy to mark the answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you further. Cheers!

  • 1 kudos
1 More Replies
sanjay
by Valued Contributor II
  • 28363 Views
  • 1 replies
  • 1 kudos

Resolved! torch.cuda.OutOfMemoryError: CUDA out of memory

Hi,I am using pynote/whisper large model and trying to process data using spark UDF and getting following error.torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 14.76 GiB total capacity; 6.07 GiB already allocated...

  • 28363 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Sanjay Jain​ : The error message suggests that there is not enough available memory on the GPU to allocate for the PyTorch model. This error can occur if the model is too large to fit into the available memory on the GPU, or if the GPU memory is bei...

  • 1 kudos
zzy
by New Contributor III
  • 1314 Views
  • 2 replies
  • 2 kudos

Why is GPU accelerated node much slower than CPU node for training a random forest model on databricks?

I have a dataset about 5 million rows with 14 features and a binary target. I decided to train a pyspark random forest classifier on Databricks. The CPU cluster I created contains 2 c4.8xlarge workers (60GB, 36core) and 1 r4.xlarge (31GB, 4core) driv...

  • 1314 Views
  • 2 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

In many cases, you need to adjust your code to utilize GPU.

  • 2 kudos
1 More Replies
Labels