Databricks

KrishZ · ‎09-08-2022

Question

It would be great if you could recommend how I go about solving the below problem. I haven't been able to find much help online.

A. Background:

A1. I have to text manipulation using python (like concatenation , convert to spacy doc , get verbs from the spacy doc etc) for 1 million records

A2. Each record takes 1 sec, meaning it will take 10 days for 1 million records !

A3. There's no ML model I am using. It's just basic text manipulation using python

B. High level problem :

Speed up the above run using concurrent jobs that databricks has.

C. I have been recommended the below steps but unsure of how to proceed. Please help on how to proceed 🙂

C1. I have been recommended to create a table in Databricks for my input data (1 million rows x 5 columns).

C2. Add 2 additional columns - Result Column and Status Column (with entries NaN/InProgress/Completed) to the table

C3. Split the table into 10 jobs such that the records with Status=NaN are sent for processing (python script), and the Status is updated to InProgress/Completed depending upon the script's completion for that record..

C4. Have been asked to use spark dataframe in python script instead of pandas dataframe.

D. What I have tried already:

D1. I have simply changed my Python code from Python Pandas to Pyspark.Pandas (which is a Pandas API for Spark and is supposed to work similar to Spark Dataframe.. )

D2. But with above ^ , I have not been able to achieve much improvements - My python code executed 30% faster for 300 records , but I think that's just because of a better Databricks processor (Azure cloud)

D3. For larger records , my databricks noteboook gives a Pickling Error which I have described in detail in the stackoverflow question (for which there is no answer unfortunately)

-werners- · ‎09-08-2022

1 million records is really nothing. Spark can handle that without any problem, even on a single worker cluster.

The important thing is to NOT use a loop to iterate over every record (which I suspect you do) When you create a dataframe from your data, you can consider a column as a vector or a set, which can be manipulated in one single go.

f.e.

dataframe = dataframe.withcolumn("newcol", concat("col1", "col2"))

this will add a column newcol without having to worry about looping and state etc.

KrishZ · ‎09-08-2022

Thanks for the reply.

My main function (i.e. looping through 1 million samples for text manipulation) is actually done on a dataframe directly and it uses the below format:

df[ColA] = df[ColB].apply(my_function).

It's the "my_function" bit that has for loops that cannot be avoided.

For example let's take a very simple inbuilt function str.lower() for "HELLO WORLD". The source code for this function has for loops and there is no alternative to for loops (in my opinion)

The problem in my case as well is that loops can't be avoided as my string manipulation is not as simple as concatenation always. Many a times there are regex checks involved across a list 400 words (this list is not in the dataframe). Other times it's a huge function which requires looping through a tokenized spacy doc object (to find entities like Company names)

-werners- · ‎09-08-2022

I still am not convinced you need loops.

Regex matches can be done in spark using regexp_replace, regexp_find etc.

You list of words to check against can also be put in a dataframe.

I agree it does not seem obvious, but the moment you start looping you say goodbye to parallel processing.

Anonymous · ‎09-23-2022

Hi @Krishna Zanwar

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Databricks

How to use Parallel processing using Concurrent Jobs in Databricks ?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI