I do have some legacy pandas codes which I want to migrate to spark to leaverage parellelization in Databricks.
I see datadricks has launched a wrapper package on top of pandas which uses pandas nomenclature but use spark engine in the backend.
I comfortably am able to convert my pandas codebase to spark version just by replacing my import statement from "import pandas as pd" to "import pyspark.pandas as pd".
But the challenge I face is that pandas relies on numpy package for case when conditions and pyspark.pandas is not supporting numpy to work along with currently.
I just wanted to know if there is a spark version of numpy for pyspark.pandas to work with?
or is there any better alternative approach that I'm missing
the way i wanted it to work :
tab_tl['Loan_Type']=np.where(tab_tl['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN',tab_tl['Loan_Type'])
my work around:
tab_tl = tab_tl.to_spark() #converting my wrapper df to native spark data frame
tab_tl = tab_tl.withColumn("Loan_type", when(tab_tl['Loan_type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(tab_tl['Loan_type']))
tab_tl = pd.DataFrame(tab_tl) #converting back native spark data frame to wrapper df to pass to next stages.