cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How do I use numpy case when condition in pyspark.pandas?

mahesh_vardhan_
New Contributor

I do have some legacy pandas codes which I want to migrate to spark to leaverage parellelization in Databricks.

I see datadricks has launched a wrapper package on top of pandas which uses pandas nomenclature but use spark engine in the backend.

I comfortably am able to convert my pandas codebase to spark version just by replacing my import statement from "import pandas as pd" to "import pyspark.pandas as pd".

But the challenge I face is that pandas relies on numpy package for case when conditions and pyspark.pandas is not supporting numpy to work along with currently.

I just wanted to know if there is a spark version of numpy for pyspark.pandas to work with?

or is there any better alternative approach that I'm missing

the way i wanted it to work :

tab_tl['Loan_Type']=np.where(tab_tl['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN',tab_tl['Loan_Type'])

my work around:

tab_tl = tab_tl.to_spark() #converting my wrapper df to native spark data frame

tab_tl = tab_tl.withColumn("Loan_type", when(tab_tl['Loan_type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(tab_tl['Loan_type']))

tab_tl = pd.DataFrame(tab_tl) #converting back native spark data frame to wrapper df to pass to next stages.

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@mahesh vardhan gandhiโ€‹ :

There is no Spark version of NumPy for PySpark Pandas to work with currently. PySpark Pandas is a new library and is still in development, so it may not have all the features of Pandas or other libraries that Pandas depends on. Some options to think about

OPTION 1: Why dont you convert your code Spark SQL to execute your case when conditions something like below

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Register the DataFrame as a temporary view so you can query it using Spark SQL
spark_df.createOrReplaceTempView("my_table")
 
# Execute your case when condition using Spark SQL
result = spark.sql("SELECT *, CASE WHEN Loan_Type = 'AUTO LOAN (PERSONAL)' THEN 'AUTO LOAN' ELSE Loan_Type END AS Loan_Type FROM my_table")
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

OPTION 2: Why dont you try PySpark's built-in functions such as when instead of NumPy's where

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Execute your case when condition using PySpark's when function
result = spark_df.withColumn("Loan_Type", when(spark_df['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(spark_df['Loan_Type']))
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@mahesh vardhan gandhiโ€‹ :

There is no Spark version of NumPy for PySpark Pandas to work with currently. PySpark Pandas is a new library and is still in development, so it may not have all the features of Pandas or other libraries that Pandas depends on. Some options to think about

OPTION 1: Why dont you convert your code Spark SQL to execute your case when conditions something like below

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Register the DataFrame as a temporary view so you can query it using Spark SQL
spark_df.createOrReplaceTempView("my_table")
 
# Execute your case when condition using Spark SQL
result = spark.sql("SELECT *, CASE WHEN Loan_Type = 'AUTO LOAN (PERSONAL)' THEN 'AUTO LOAN' ELSE Loan_Type END AS Loan_Type FROM my_table")
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

OPTION 2: Why dont you try PySpark's built-in functions such as when instead of NumPy's where

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Execute your case when condition using PySpark's when function
result = spark_df.withColumn("Loan_Type", when(spark_df['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(spark_df['Loan_Type']))
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

Anonymous
Not applicable

Hi @mahesh vardhan gandhiโ€‹ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.