topic Re: How do I use numpy case when condition in pyspark.pandas? in Data Engineering

How do I use numpy case when condition in pyspark.pandas?

mahesh_vardhan_ — Thu, 02 Mar 2023 08:40:23 GMT

I do have some legacy pandas codes which I want to migrate to spark to leaverage parellelization in Databricks.

I see datadricks has launched a wrapper package on top of pandas which uses pandas nomenclature but use spark engine in the backend.

I comfortably am able to convert my pandas codebase to spark version just by replacing my import statement from "import pandas as pd" to "import pyspark.pandas as pd".

But the challenge I face is that pandas relies on numpy package for case when conditions and pyspark.pandas is not supporting numpy to work along with currently.

I just wanted to know if there is a spark version of numpy for pyspark.pandas to work with?

or is there any better alternative approach that I'm missing

the way i wanted it to work :

tab_tl['Loan_Type']=np.where(tab_tl['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN',tab_tl['Loan_Type'])

my work around:

tab_tl = tab_tl.to_spark() #converting my wrapper df to native spark data frame

tab_tl = tab_tl.withColumn("Loan_type", when(tab_tl['Loan_type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(tab_tl['Loan_type']))

tab_tl = pd.DataFrame(tab_tl) #converting back native spark data frame to wrapper df to pass to next stages.

Re: How do I use numpy case when condition in pyspark.pandas?

Anonymous — Mon, 13 Mar 2023 00:21:35 GMT

@mahesh vardhan gandhi :

There is no Spark version of NumPy for PySpark Pandas to work with currently. PySpark Pandas is a new library and is still in development, so it may not have all the features of Pandas or other libraries that Pandas depends on. Some options to think about

OPTION 1: Why dont you convert your code Spark SQL to execute your case when conditions something like below

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Register the DataFrame as a temporary view so you can query it using Spark SQL
spark_df.createOrReplaceTempView("my_table")
 
# Execute your case when condition using Spark SQL
result = spark.sql("SELECT *, CASE WHEN Loan_Type = 'AUTO LOAN (PERSONAL)' THEN 'AUTO LOAN' ELSE Loan_Type END AS Loan_Type FROM my_table")
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

OPTION 2: Why dont you try PySpark's built-in functions such as when instead of NumPy's where

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Execute your case when condition using PySpark's when function
result = spark_df.withColumn("Loan_Type", when(spark_df['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(spark_df['Loan_Type']))
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

Re: How do I use numpy case when condition in pyspark.pandas?

Anonymous — Fri, 17 Mar 2023 05:30:40 GMT

Hi @mahesh vardhan gandhi

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!