Showing results for 
Search instead for 
Did you mean: 

Trying out Dolly - how to load pytorch_model.bin so it's not downloading it every time the cluster is restarted

New Contributor II

Hi, I am new to LLM and am curious to try it out. I did the following code to test from the databricks website:

import torch
from transformers import pipeline
instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

and it seems to be downding a 24gig model file every time the cluster is restarted.

Downloading (…)"pytorch_model.bin";: 100% - 23.8G/23.8G [02:39<00:00, 128MB/s]

is there a way (and where can i find the instructions) to load the pytorch_model.bin file "locally" so it's not downloading it every time the cluster is restarted?

Add-on question, what's a decent cluster config to test things out? so far I've been trying to test it with g4dn.2xlarge (32gig, 1 gpu) with 12.2 lts ml (gpu) and it's telling me a CUDA out of memory error.

OutOfMemoryError: CUDA out of memory. Tried to allocate 492.00 MiB (GPU 0; 14.76 GiB total capacity; 13.52 GiB already allocated; 483.75 MiB free; 13.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF


Not applicable

@H T​ : I wont have a specific answer for Dolly right now, but i shall give a framework to think about it for you to test and try.

To avoid downloading the model every time the cluster is restarted, you can upload the pytorch_model.bin file to your Databricks workspace or to a cloud storage account and then load it from there instead of using the default model location. You can do this by specifying the model

argument as the path to the uploaded model file:

instruct_pipeline = pipeline(model="/path/to/local/model/pytorch_model.bin", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

As for the cluster configuration, it depends on the size of your data and the complexity of your models. For testing purposes, you can start with a smaller instance size and scale up as needed. You can also try adjusting the max_split_size_mb parameter to avoid the CUDA out of memory error. This parameter controls the maximum size (in MB) of each tensor split. You can set it to a smaller value to reduce memory usage, but this may also slow down training.

New Contributor II

@Suteja Kanuri​ -


Thanks for responding. I've tried your suggestion however got an error


ValueError: The following `model_kwargs` are not used by the model: ['max_split_size_mb'] (note: typos in the generate arguments will also show up in this list)


Specially I am testing the demo Databricks provided (, llm-dolly-chatbot) and I am getting this error in 03-Q&A-prompt-engineering-for-dolly in the build_qa_chain() function when pipeline was called.


Additional info:

  • I am running this on aws g4dn.xlarge with the T4 GPU (this is what the dbdemo script selected), i have g5, p3 available - would I have better luck there?

New Contributor II

@Suteja Kanuri​ Update - I was able to get it to work by upgrading to a g4dn.12xlarge node (4 gpus).

However, the code in 02-Data-preparation to apply sshleifer/distilbart-cnn-12-6 model for a summarization task failed with the more powerful node (while it worked fine with just a single GPU). Do you have any suggestions there?

I set repartition to 4 since there were 4 GPUs. docs_limit_df has 4 rows.

summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device_map="auto")
docs_limit_df = docs_limit_df.repartition(4).withColumn("text_short", summarize_all("text"))

The error I got was

"PythonException: 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!', from <command-434574176370212>, line 8. Full traceback below:"

Not applicable

Hi @H T​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 


Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.