cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Trying out Dolly - how to load pytorch_model.bin so it's not downloading it every time the cluster is restarted

HT
New Contributor II

Hi, I am new to LLM and am curious to try it out. I did the following code to test from the databricks website:

import torch
from transformers import pipeline
instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

and it seems to be downding a 24gig model file every time the cluster is restarted.

Downloading (…)"pytorch_model.bin";: 100% - 23.8G/23.8G [02:39<00:00, 128MB/s]

is there a way (and where can i find the instructions) to load the pytorch_model.bin file "locally" so it's not downloading it every time the cluster is restarted?

Add-on question, what's a decent cluster config to test things out? so far I've been trying to test it with g4dn.2xlarge (32gig, 1 gpu) with 12.2 lts ml (gpu) and it's telling me a CUDA out of memory error.

OutOfMemoryError: CUDA out of memory. Tried to allocate 492.00 MiB (GPU 0; 14.76 GiB total capacity; 13.52 GiB already allocated; 483.75 MiB free; 13.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

5 REPLIES 5

Anonymous
Not applicable

@H T​ : I wont have a specific answer for Dolly right now, but i shall give a framework to think about it for you to test and try.

To avoid downloading the model every time the cluster is restarted, you can upload the pytorch_model.bin file to your Databricks workspace or to a cloud storage account and then load it from there instead of using the default model location. You can do this by specifying the model

argument as the path to the uploaded model file:

instruct_pipeline = pipeline(model="/path/to/local/model/pytorch_model.bin", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

As for the cluster configuration, it depends on the size of your data and the complexity of your models. For testing purposes, you can start with a smaller instance size and scale up as needed. You can also try adjusting the max_split_size_mb parameter to avoid the CUDA out of memory error. This parameter controls the maximum size (in MB) of each tensor split. You can set it to a smaller value to reduce memory usage, but this may also slow down training.

HT
New Contributor II

@Suteja Kanuri​ -

Hi,

Thanks for responding. I've tried your suggestion however got an error

"

ValueError: The following `model_kwargs` are not used by the model: ['max_split_size_mb'] (note: typos in the generate arguments will also show up in this list)

"

Specially I am testing the demo Databricks provided (https://www.dbdemos.ai/, llm-dolly-chatbot) and I am getting this error in 03-Q&A-prompt-engineering-for-dolly in the build_qa_chain() function when pipeline was called.

Thoughts?

Additional info:

  • I am running this on aws g4dn.xlarge with the T4 GPU (this is what the dbdemo script selected), i have g5, p3 available - would I have better luck there?

HT
New Contributor II

@Suteja Kanuri​ Update - I was able to get it to work by upgrading to a g4dn.12xlarge node (4 gpus).

However, the code in 02-Data-preparation to apply sshleifer/distilbart-cnn-12-6 model for a summarization task failed with the more powerful node (while it worked fine with just a single GPU). Do you have any suggestions there?

I set repartition to 4 since there were 4 GPUs. docs_limit_df has 4 rows.

torch.cuda.empty_cache()
 
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device_map="auto")
 
docs_limit_df = docs_limit_df.repartition(4).withColumn("text_short", summarize_all("text"))

The error I got was

"PythonException: 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!', from <command-434574176370212>, line 8. Full traceback below:"

Anonymous
Not applicable

Hi @H T​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

sean_owen
Honored Contributor II
Honored Contributor II

Just set the HF cache dir to a persistent path on /dbfs:

import os
os.environ['TRANSFORMERS_CACHE'] = "/dbfs/..."

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!