cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Alex_Persin
by New Contributor II
  • 4421 Views
  • 3 replies
  • 2 kudos

How can the shared memory size (/dev/shm) be increased on databricks worker nodes with custom docker images?

PyTorch uses shared memory to efficiently share tensors between its dataloader workers and its main process. However in a docker container the default size of the shared memory (a tmpfs file system mounted at /dev/shm) is 64MB, which is too small to ...

  • 4421 Views
  • 3 replies
  • 2 kudos
Latest Reply
Hugh_Ku
New Contributor II
  • 2 kudos

Hey folks, any follow-up on this, or alternative solution? thanks

  • 2 kudos
2 More Replies
hvsk
by New Contributor
  • 10894 Views
  • 2 replies
  • 0 kudos

Using a Virtual environment

Hi All,We are working on training NHits/TFT (a Pytorch-forecasting implementation) for timeseries forecasting. However, we are having some issues with package dependency conflicts.Is there a way to consistently use a virtual environment across cells ...

  • 10894 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Harsh Kalra​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so w...

  • 0 kudos
1 More Replies
Smu_Tan
by New Contributor
  • 2158 Views
  • 3 replies
  • 1 kudos

Resolved! Does Databricks supports the Pytorch Distributed Training for multiple devices?

Hi, Im trying to use the databricks platform to do the pytorch distributed training, but I didnt find any info about this. What I expected is using multiple clusters to run a common job using pytorch distributed data parallel (DDP) with the code belo...

  • 2158 Views
  • 3 replies
  • 1 kudos
Latest Reply
axb0
Esteemed Contributor III
  • 1 kudos

With Databricks MLR, HorovodRunner is provided which supports distributed training and inference with PyTorch. Here's an example notebook for your reference: PyTorchDistributedDeepLearningTraining - Databricks.

  • 1 kudos
2 More Replies
Labels