Databricks Community

User16752239203 · ‎06-11-2021

I have an NLP application that I build on my local machine using spacy and pandas, but now I would like to scale my application to a large production dataset and utilize the benefits of sparks distributed compute. How do I import and utilize a library like spacy with Databricks/Spark?

sean_owen · ‎06-17-2021

It depends on what you mean, but if you're just trying to (say) tokenize and process data with spacy in parallel, then that's trivial. Write a 'pandas UDF' function that expresses how you want to transform data using spacy, in terms of a pandas DataFrame of input. Then you just apply that pandas UDF to your data with Spark; Spark will automatically chunk your data into pandas DataFrames, apply your function, and handle the results.

Databricks Community

How can I use Non- Spark related libraries like spacy with Databricks and Spark

Connect with Databricks Users in Your Area

Meet the Databricks MVPs

Databricks training invests in closing the data + AI skills gap across enterprises

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs

Data + AI Summit: Call for Presentations

Season's Speedings: Databricks SQL Delivers 4x Performance Boost Over Two Years