Anonymous
Not applicable

@mahesh vardhan gandhi​ :

There is no Spark version of NumPy for PySpark Pandas to work with currently. PySpark Pandas is a new library and is still in development, so it may not have all the features of Pandas or other libraries that Pandas depends on. Some options to think about

OPTION 1: Why dont you convert your code Spark SQL to execute your case when conditions something like below

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Register the DataFrame as a temporary view so you can query it using Spark SQL
spark_df.createOrReplaceTempView("my_table")
 
# Execute your case when condition using Spark SQL
result = spark.sql("SELECT *, CASE WHEN Loan_Type = 'AUTO LOAN (PERSONAL)' THEN 'AUTO LOAN' ELSE Loan_Type END AS Loan_Type FROM my_table")
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

OPTION 2: Why dont you try PySpark's built-in functions such as when instead of NumPy's where

from pyspark.sql.functions import when
 
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
 
# Execute your case when condition using PySpark's when function
result = spark_df.withColumn("Loan_Type", when(spark_df['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(spark_df['Loan_Type']))
 
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()

View solution in original post