cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

RuntimeError: Expected to mark a variable ready only once error

saleem_shady
New Contributor

I'm using a Single Node machine with g5-2x-large to fine tune a LLaMa-2 model. My Come Notebook runs very smoothly on Google Col but when I try to run it on `Databricks`, it throws me the exact error given below:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 has been marked as ready twice. This means that multiple auto-grad engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.


Here is my code for Fine Tuning LLaMa v-2 and  Original Issue 

1 REPLY 1

jessysantos
Databricks Employee
Databricks Employee

Hello @saleem_shady!

Have you tried including the parameter ddp_find_unused_parameters=False in your TrainingArguments? Here's an example of how to include it: https://github.com/databricks/databricks-ml-examples/blob/master/llm-models/llamav2/llamav2-7b/06_fi...

If you have already included this parameter and are still encountering issues, please share the error message you are receiving as a reply in this post.

Best Regards,

Jรฉssica Santos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group