- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-02-2023 12:40 AM
I do have some legacy pandas codes which I want to migrate to spark to leaverage parellelization in Databricks.
I see datadricks has launched a wrapper package on top of pandas which uses pandas nomenclature but use spark engine in the backend.
I comfortably am able to convert my pandas codebase to spark version just by replacing my import statement from "import pandas as pd" to "import pyspark.pandas as pd".
But the challenge I face is that pandas relies on numpy package for case when conditions and pyspark.pandas is not supporting numpy to work along with currently.
I just wanted to know if there is a spark version of numpy for pyspark.pandas to work with?
or is there any better alternative approach that I'm missing
the way i wanted it to work :
tab_tl['Loan_Type']=np.where(tab_tl['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN',tab_tl['Loan_Type'])
my work around:
tab_tl = tab_tl.to_spark() #converting my wrapper df to native spark data frame
tab_tl = tab_tl.withColumn("Loan_type", when(tab_tl['Loan_type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(tab_tl['Loan_type']))
tab_tl = pd.DataFrame(tab_tl) #converting back native spark data frame to wrapper df to pass to next stages.
- Labels:
-
Condition
-
Import Pandas
-
Pandas
-
Pyspark.pandas
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-12-2023 05:21 PM
@mahesh vardhan gandhiโ :
There is no Spark version of NumPy for PySpark Pandas to work with currently. PySpark Pandas is a new library and is still in development, so it may not have all the features of Pandas or other libraries that Pandas depends on. Some options to think about
OPTION 1: Why dont you convert your code Spark SQL to execute your case when conditions something like below
from pyspark.sql.functions import when
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
# Register the DataFrame as a temporary view so you can query it using Spark SQL
spark_df.createOrReplaceTempView("my_table")
# Execute your case when condition using Spark SQL
result = spark.sql("SELECT *, CASE WHEN Loan_Type = 'AUTO LOAN (PERSONAL)' THEN 'AUTO LOAN' ELSE Loan_Type END AS Loan_Type FROM my_table")
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()
OPTION 2: Why dont you try PySpark's built-in functions such as when instead of NumPy's where
from pyspark.sql.functions import when
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
# Execute your case when condition using PySpark's when function
result = spark_df.withColumn("Loan_Type", when(spark_df['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(spark_df['Loan_Type']))
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-12-2023 05:21 PM
@mahesh vardhan gandhiโ :
There is no Spark version of NumPy for PySpark Pandas to work with currently. PySpark Pandas is a new library and is still in development, so it may not have all the features of Pandas or other libraries that Pandas depends on. Some options to think about
OPTION 1: Why dont you convert your code Spark SQL to execute your case when conditions something like below
from pyspark.sql.functions import when
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
# Register the DataFrame as a temporary view so you can query it using Spark SQL
spark_df.createOrReplaceTempView("my_table")
# Execute your case when condition using Spark SQL
result = spark.sql("SELECT *, CASE WHEN Loan_Type = 'AUTO LOAN (PERSONAL)' THEN 'AUTO LOAN' ELSE Loan_Type END AS Loan_Type FROM my_table")
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()
OPTION 2: Why dont you try PySpark's built-in functions such as when instead of NumPy's where
from pyspark.sql.functions import when
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
# Execute your case when condition using PySpark's when function
result = spark_df.withColumn("Loan_Type", when(spark_df['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(spark_df['Loan_Type']))
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-16-2023 10:30 PM
Hi @mahesh vardhan gandhiโ
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!