cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Loading Pre-trained Models in Databricks

PSK017
New Contributor

Hello talented members of the community,

I'm a very new Databricks user so please bear with me.

I'm building a description matcher which uses a pre-trained model (

universal-sentence-encoder). How can I load and use this model in my Databricks python notebook?
 
I have the zipped model downloaded in my local machine. It contains a .pb file and a variables folder.
 
Just to extend on this, what are the best practices for loading and using pre-trained models in Databricks?
 
Thank you for any help. Greatly appreciate it!
2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @PSK017, Let’s break down the steps for loading and using the Universal Sentence Encoder (which is a powerful pre-trained model) in your Databricks Python notebook.

 

Load the Universal Sentence Encoder:

  • First, you’ll need to install the necessary package. Run the following command in a cell to install the required library:%%capture !pip3 install tensorflow_hub
  • Next, you can load the Universal Sentence Encoder using the following code snippet:import tensorflow_hub as hub embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")

Compute Sentence Embeddings:

  • Once you’ve loaded the model, you can use it to compute embeddings for sentences. For example:sentences = [    "The quick brown fox jumps over the lazy dog.",    "I am a sentence for which I would like to get its embedding." ] embeddings = embed(sentences) print(embeddings)
  • The embeddings variable will contain the vector representations of your sentences.

Best Practices for Using Pre-trained Models in Databricks:

 

Transformers Pipelines: For many NLP tasks, you can use Hugging Face’s Transformers pipelines. These pipelines encapsulate components like tokenizers and models, making it easy to get started. They also handle GPU usage and batching efficiently. Example:

  • from transformers import pipeline summarizer = pipeline("summarization", device=0 if torch.cuda.is_available() else -1)

Distribute Inference on Spark: To distribute inference on Spark, encapsulate the pipeline in a pandas UDF. Spark will automatically reassign GPUs to workers, allowing seamless multi-GPU multi-machine cluster usage.

Fine-tuning (Optional):

  • If you have specific domain data, consider fine-tuning the pretrained model on your own data. Fine-tuning allows you to adapt the model to your specific task.

Remember that Databricks is an excellent platform for running Hugging Face Transformers, and you can leverage pre-trained models effectively for various NLP tasks. Feel free to explore more and adapt these practices to your specific use case! 🚀🔍

 

For additional best practices related to Databricks, you might also find the blog post on Super Powering Your dbt Project helpful.

Kaniz
Community Manager
Community Manager

Hey there! Thanks a bunch for being part of our awesome community! 🎉 

We love having you around and appreciate all your questions. Take a moment to check out the responses – you'll find some great info. Your input is valuable, so pick the best solution for you. And remember, if you ever need more help , we're here for you! 

Keep being awesome! 😊🚀

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.