<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How do I use numpy case when condition in pyspark.pandas? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-do-i-use-numpy-case-when-condition-in-pyspark-pandas/m-p/8389#M4041</link>
    <description>&lt;P&gt;Hi @mahesh vardhan gandhi​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 17 Mar 2023 05:30:40 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2023-03-17T05:30:40Z</dc:date>
    <item>
      <title>How do I use numpy case when condition in pyspark.pandas?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-use-numpy-case-when-condition-in-pyspark-pandas/m-p/8387#M4039</link>
      <description>&lt;P&gt;I do have some legacy pandas codes which I want to migrate to spark to leaverage parellelization in Databricks. &lt;/P&gt;&lt;P&gt;I see datadricks has launched a wrapper package on top of pandas which uses pandas nomenclature but use spark engine in the backend.&lt;/P&gt;&lt;P&gt;I comfortably am able to convert my pandas codebase to spark version just by replacing my import statement from "&lt;B&gt;import pandas as pd&lt;/B&gt;" to "&lt;B&gt;import pyspark.pandas as pd&lt;/B&gt;".&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But the challenge I face is that pandas relies on numpy package for case when conditions and  pyspark.pandas is not supporting numpy to work along with currently.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I just wanted to know if there is a spark version of numpy for pyspark.pandas to work with?&lt;/P&gt;&lt;P&gt;or is there any better alternative approach that I'm missing&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;the way i wanted it to work :&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;tab_tl['Loan_Type']=np.where(tab_tl['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN',tab_tl['Loan_Type'])&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;my work around:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;tab_tl = tab_tl.to_spark()  &lt;B&gt; #converting my wrapper df to native spark data frame&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;tab_tl = tab_tl.withColumn("Loan_type", when(tab_tl['Loan_type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(tab_tl['Loan_type']))&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;tab_tl = pd.DataFrame(tab_tl) &lt;B&gt; #converting back native spark data frame to wrapper df to pass to next stages.&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Mar 2023 08:40:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-use-numpy-case-when-condition-in-pyspark-pandas/m-p/8387#M4039</guid>
      <dc:creator>mahesh_vardhan_</dc:creator>
      <dc:date>2023-03-02T08:40:23Z</dc:date>
    </item>
    <item>
      <title>Re: How do I use numpy case when condition in pyspark.pandas?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-use-numpy-case-when-condition-in-pyspark-pandas/m-p/8388#M4040</link>
      <description>&lt;P&gt;@mahesh vardhan gandhi​&amp;nbsp;:&lt;/P&gt;&lt;P&gt;There is no Spark version of NumPy for PySpark Pandas to work with currently. PySpark Pandas is a new library and is still in development, so it may not have all the features of Pandas or other libraries that Pandas depends on. Some options to think about&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;OPTION 1:&lt;/B&gt; Why dont you convert your code Spark SQL to execute your case when conditions something like below&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.functions import when
&amp;nbsp;
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
&amp;nbsp;
# Register the DataFrame as a temporary view so you can query it using Spark SQL
spark_df.createOrReplaceTempView("my_table")
&amp;nbsp;
# Execute your case when condition using Spark SQL
result = spark.sql("SELECT *, CASE WHEN Loan_Type = 'AUTO LOAN (PERSONAL)' THEN 'AUTO LOAN' ELSE Loan_Type END AS Loan_Type FROM my_table")
&amp;nbsp;
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;B&gt;OPTION 2: &lt;/B&gt;Why dont you try PySpark's built-in functions such as when instead of NumPy's where&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.functions import when
&amp;nbsp;
# Convert your Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
&amp;nbsp;
# Execute your case when condition using PySpark's when function
result = spark_df.withColumn("Loan_Type", when(spark_df['Loan_Type']=='AUTO LOAN (PERSONAL)','AUTO LOAN').otherwise(spark_df['Loan_Type']))
&amp;nbsp;
# Convert the resulting Spark DataFrame back to a Pandas DataFrame
result_pandas = result.toPandas()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 13 Mar 2023 00:21:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-use-numpy-case-when-condition-in-pyspark-pandas/m-p/8388#M4040</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-03-13T00:21:35Z</dc:date>
    </item>
    <item>
      <title>Re: How do I use numpy case when condition in pyspark.pandas?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-do-i-use-numpy-case-when-condition-in-pyspark-pandas/m-p/8389#M4041</link>
      <description>&lt;P&gt;Hi @mahesh vardhan gandhi​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 17 Mar 2023 05:30:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-do-i-use-numpy-case-when-condition-in-pyspark-pandas/m-p/8389#M4041</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-03-17T05:30:40Z</dc:date>
    </item>
  </channel>
</rss>

