Databricks Community

HT · ‎04-13-2023

Hi, I am new to LLM and am curious to try it out. I did the following code to test from the databricks website:

import torch
from transformers import pipeline
instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

and it seems to be downding a 24gig model file every time the cluster is restarted.

Downloading (…)"pytorch_model.bin";: 100% - 23.8G/23.8G [02:39<00:00, 128MB/s]

is there a way (and where can i find the instructions) to load the pytorch_model.bin file "locally" so it's not downloading it every time the cluster is restarted?

Add-on question, what's a decent cluster config to test things out? so far I've been trying to test it with g4dn.2xlarge (32gig, 1 gpu) with 12.2 lts ml (gpu) and it's telling me a CUDA out of memory error.

OutOfMemoryError: CUDA out of memory. Tried to allocate 492.00 MiB (GPU 0; 14.76 GiB total capacity; 13.52 GiB already allocated; 483.75 MiB free; 13.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Anonymous · ‎04-15-2023

@H T : I wont have a specific answer for Dolly right now, but i shall give a framework to think about it for you to test and try.

To avoid downloading the model every time the cluster is restarted, you can upload the pytorch_model.bin file to your Databricks workspace or to a cloud storage account and then load it from there instead of using the default model location. You can do this by specifying the model

argument as the path to the uploaded model file:

instruct_pipeline = pipeline(model="/path/to/local/model/pytorch_model.bin", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

As for the cluster configuration, it depends on the size of your data and the complexity of your models. For testing purposes, you can start with a smaller instance size and scale up as needed. You can also try adjusting the max_split_size_mb parameter to avoid the CUDA out of memory error. This parameter controls the maximum size (in MB) of each tensor split. You can set it to a smaller value to reduce memory usage, but this may also slow down training.

HT · ‎05-16-2023

@Suteja Kanuri -

Hi,

Thanks for responding. I've tried your suggestion however got an error

"

ValueError: The following `model_kwargs` are not used by the model: ['max_split_size_mb'] (note: typos in the generate arguments will also show up in this list)

"

Specially I am testing the demo Databricks provided (https://www.dbdemos.ai/, llm-dolly-chatbot) and I am getting this error in 03-Q&A-prompt-engineering-for-dolly in the build_qa_chain() function when pipeline was called.

Thoughts?

Additional info:

I am running this on aws g4dn.xlarge with the T4 GPU (this is what the dbdemo script selected), i have g5, p3 available - would I have better luck there?

HT · ‎05-16-2023

@Suteja Kanuri Update - I was able to get it to work by upgrading to a g4dn.12xlarge node (4 gpus).

However, the code in 02-Data-preparation to apply sshleifer/distilbart-cnn-12-6 model for a summarization task failed with the more powerful node (while it worked fine with just a single GPU). Do you have any suggestions there?

I set repartition to 4 since there were 4 GPUs. docs_limit_df has 4 rows.

torch.cuda.empty_cache()
 
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device_map="auto")
 
docs_limit_df = docs_limit_df.repartition(4).withColumn("text_short", summarize_all("text"))

The error I got was

"PythonException: 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!', from <command-434574176370212>, line 8. Full traceback below:"

Anonymous · ‎04-15-2023

Hi @H T

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.

Cheers!

sean_owen · ‎06-02-2023

Just set the HF cache dir to a persistent path on /dbfs:

import os
os.environ['TRANSFORMERS_CACHE'] = "/dbfs/..."

Databricks Community

Trying out Dolly - how to load pytorch_model.bin so it's not downloading it every time the cluster is restarted

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon