How do I use numpy case when condition in pyspark....

mahesh_vardhan_ · ‎03-02-2023

I do have some legacy pandas codes which I want to migrate to spark to leaverage parellelization in Databricks.

I see datadricks has launched a wrapper package on top of pandas which uses pandas nomenclature but use spark engine in the backend.

I comfortably am able to convert my pandas codebase to spark version just by replacing my import statement from "import pandas as pd" to "import pyspark.pandas as pd".

But the challenge I face is that pandas relies on numpy package for case when conditions and pyspark.pandas is not supporting numpy to work along with currently.

I just wanted to know if there is a spark version of numpy for pyspark.pandas to work with?

or is there any better alternative approach that I'm missing

the way i wanted it to work :

tab_tl['Loan_Type']=np.where(tab_tl['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN',tab_tl['Loan_Type'])

my work around:

tab_tl = tab_tl.to_spark() #converting my wrapper df to native spark data frame

tab_tl = tab_tl.withColumn("Loan_type", when(tab_tl['Loan_type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(tab_tl['Loan_type']))

tab_tl = pd.DataFrame(tab_tl) #converting back native spark data frame to wrapper df to pass to next stages.

How do I use numpy case when condition in pyspark.pandas?