Anonymous
Not applicable
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-12-2023 05:21 PM
@mahesh vardhan gandhi :
There is no Spark version of NumPy for PySpark Pandas to work with currently. PySpark Pandas is a new library and is still in development, so it may not have all the features of Pandas or other libraries that Pandas depends on. Some options to think about
OPTION 1: Why dont you convert your code Spark SQL to execute your case when conditions something like below
from pyspark.sql.functions import when
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
# Register the DataFrame as a temporary view so you can query it using Spark SQL
spark_df.createOrReplaceTempView("my_table")
# Execute your case when condition using Spark SQL
result = spark.sql("SELECT *, CASE WHEN Loan_Type = 'AUTO LOAN (PERSONAL)' THEN 'AUTO LOAN' ELSE Loan_Type END AS Loan_Type FROM my_table")
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()OPTION 2: Why dont you try PySpark's built-in functions such as when instead of NumPy's where
from pyspark.sql.functions import when
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
# Execute your case when condition using PySpark's when function
result = spark_df.withColumn("Loan_Type", when(spark_df['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(spark_df['Loan_Type']))
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()