cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How do I use numpy case when condition in pyspark.pandas?

mahesh_vardhan_
New Contributor

I do have some legacy pandas codes which I want to migrate to spark to leaverage parellelization in Databricks.

I see datadricks has launched a wrapper package on top of pandas which uses pandas nomenclature but use spark engine in the backend.

I comfortably am able to convert my pandas codebase to spark version just by replacing my import statement from "import pandas as pd" to "import pyspark.pandas as pd".

But the challenge I face is that pandas relies on numpy package for case when conditions and pyspark.pandas is not supporting numpy to work along with currently.

I just wanted to know if there is a spark version of numpy for pyspark.pandas to work with?

or is there any better alternative approach that I'm missing

the way i wanted it to work :

tab_tl['Loan_Type']=np.where(tab_tl['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN',tab_tl['Loan_Type'])

my work around:

tab_tl = tab_tl.to_spark() #converting my wrapper df to native spark data frame

tab_tl = tab_tl.withColumn("Loan_type", when(tab_tl['Loan_type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(tab_tl['Loan_type']))

tab_tl = pd.DataFrame(tab_tl) #converting back native spark data frame to wrapper df to pass to next stages.

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@mahesh vardhan gandhiโ€‹ :

There is no Spark version of NumPy for PySpark Pandas to work with currently. PySpark Pandas is a new library and is still in development, so it may not have all the features of Pandas or other libraries that Pandas depends on. Some options to think about

OPTION 1: Why dont you convert your code Spark SQL to execute your case when conditions something like below

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Register the DataFrame as a temporary view so you can query it using Spark SQL
spark_df.createOrReplaceTempView("my_table")
 
# Execute your case when condition using Spark SQL
result = spark.sql("SELECT *, CASE WHEN Loan_Type = 'AUTO LOAN (PERSONAL)' THEN 'AUTO LOAN' ELSE Loan_Type END AS Loan_Type FROM my_table")
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

OPTION 2: Why dont you try PySpark's built-in functions such as when instead of NumPy's where

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Execute your case when condition using PySpark's when function
result = spark_df.withColumn("Loan_Type", when(spark_df['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(spark_df['Loan_Type']))
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@mahesh vardhan gandhiโ€‹ :

There is no Spark version of NumPy for PySpark Pandas to work with currently. PySpark Pandas is a new library and is still in development, so it may not have all the features of Pandas or other libraries that Pandas depends on. Some options to think about

OPTION 1: Why dont you convert your code Spark SQL to execute your case when conditions something like below

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Register the DataFrame as a temporary view so you can query it using Spark SQL
spark_df.createOrReplaceTempView("my_table")
 
# Execute your case when condition using Spark SQL
result = spark.sql("SELECT *, CASE WHEN Loan_Type = 'AUTO LOAN (PERSONAL)' THEN 'AUTO LOAN' ELSE Loan_Type END AS Loan_Type FROM my_table")
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

OPTION 2: Why dont you try PySpark's built-in functions such as when instead of NumPy's where

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Execute your case when condition using PySpark's when function
result = spark_df.withColumn("Loan_Type", when(spark_df['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(spark_df['Loan_Type']))
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

Anonymous
Not applicable

Hi @mahesh vardhan gandhiโ€‹ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group