Question
It would be great if you could recommend how I go about solving the below problem. I haven't been able to find much help online.
A. Background:
A1. I have to text manipulation using python (like concatenation , convert to spacy doc , get verbs from the spacy doc etc) for 1 million records
A2. Each record takes 1 sec, meaning it will take 10 days for 1 million records !
A3. There's no ML model I am using. It's just basic text manipulation using python
B. High level problem :
Speed up the above run using concurrent jobs that databricks has.
C. I have been recommended the below steps but unsure of how to proceed. Please help on how to proceed 🙂
C1. I have been recommended to create a table in Databricks for my input data (1 million rows x 5 columns).
C2. Add 2 additional columns - Result Column and Status Column (with entries NaN/InProgress/Completed) to the table
C3. Split the table into 10 jobs such that the records with Status=NaN are sent for processing (python script), and the Status is updated to InProgress/Completed depending upon the script's completion for that record..
C4. Have been asked to use spark dataframe in python script instead of pandas dataframe.
D. What I have tried already:
D1. I have simply changed my Python code from Python Pandas to Pyspark.Pandas (which is a Pandas API for Spark and is supposed to work similar to Spark Dataframe.. )
D2. But with above ^ , I have not been able to achieve much improvements - My python code executed 30% faster for 300 records , but I think that's just because of a better Databricks processor (Azure cloud)
D3. For larger records , my databricks noteboook gives a Pickling Error which I have described in detail in the stackoverflow question (for which there is no answer unfortunately)