cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I use Non- Spark related libraries like spacy with Databricks and Spark

User16752239203
New Contributor
New Contributor

I have an NLP application that I build on my local machine using spacy and pandas, but now I would like to scale my application to a large production dataset and utilize the benefits of sparks distributed compute. How do I import and utilize a library like spacy with Databricks/Spark?

1 REPLY 1

sean_owen
Honored Contributor II
Honored Contributor II

It depends on what you mean, but if you're just trying to (say) tokenize and process data with spacy in parallel, then that's trivial. Write a 'pandas UDF' function that expresses how you want to transform data using spacy, in terms of a pandas DataFrame of input. Then you just apply that pandas UDF to your data with Spark; Spark will automatically chunk your data into pandas DataFrames, apply your function, and handle the results.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.