<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: passing array as a parameter to PandasUDF in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/passing-array-as-a-parameter-to-pandasudf/m-p/11219#M6229</link>
    <description>&lt;P&gt;Hello @Kausthub NP​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 07 Sep 2022 11:36:56 GMT</pubDate>
    <dc:creator>Vidula</dc:creator>
    <dc:date>2022-09-07T11:36:56Z</dc:date>
    <item>
      <title>passing array as a parameter to PandasUDF</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-array-as-a-parameter-to-pandasudf/m-p/11217#M6227</link>
      <description>&lt;P&gt;Hi Team,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My python dataframe is as below.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1657iE42F3400FAF3D4CC/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;The raw data is quite a long series of approx 5000 numbers. My requirement is to go through each row in RawData column and calculate 2 metrics. I have created a function in Python and it works absolutely fine. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Python Function&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;def calculate_metrics(value_dict):&lt;/P&gt;&lt;P&gt;    df = value_dict.copy()&lt;/P&gt;&lt;P&gt;    df['Metric1'] =pd.Series(dtype='float')&lt;/P&gt;&lt;P&gt;    df['Metric2'] = pd.Series(dtype='float')&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;    &lt;/P&gt;&lt;P&gt;    for index,row in value_dict.iterrows():&lt;/P&gt;&lt;P&gt;        df.loc[df['Id']==row['Id'],'Metric1'] = Function1(row['RawData'])&lt;/P&gt;&lt;P&gt;         df.loc[df['Id']==row['Id'],'Metric2'] = Function2(row['RawData'])&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;    return df&lt;/P&gt;&lt;P&gt;Here I pass a data frame value_dict with "Identifier and RawData" as two columns.&lt;/P&gt;&lt;P&gt;I call as below&lt;/P&gt;&lt;P&gt;value_dict['RawData'] = value_dict['RawData'].apply(lambda x: np.array(x))&lt;/P&gt;&lt;P&gt;df_Fullagg = calculate_metrics(value_dict)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This calculates all the metrics I need and returns in a dataframe.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The volumen of data is quite high here. I want to use spark frame work here with azure synapse. How can I write the same function using pandas_Udf. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am looking for some implementation like this.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;@pandas_udf('int', PandasUDFType.SCALAR)&lt;/P&gt;&lt;P&gt;def calculate_metrics(value_dict):&lt;/P&gt;&lt;P&gt;    df = value_dict.copy()&lt;/P&gt;&lt;P&gt;    df['Metric1'] =pd.Series(dtype='float')&lt;/P&gt;&lt;P&gt;    df['Metric2'] = pd.Series(dtype='float')&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;    &lt;/P&gt;&lt;P&gt;    for index,row in value_dict.iterrows():&lt;/P&gt;&lt;P&gt;        df.loc[df['Id']==row['Id'],'Metric1'] = Function1(row['RawData'])&lt;/P&gt;&lt;P&gt;         df.loc[df['Id']==row['Id'],'Metric2'] = Function2(row['RawData'])&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;    return df&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any help would be much appreciated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2022 19:22:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-array-as-a-parameter-to-pandasudf/m-p/11217#M6227</guid>
      <dc:creator>KNP</dc:creator>
      <dc:date>2022-08-04T19:22:51Z</dc:date>
    </item>
    <item>
      <title>Re: passing array as a parameter to PandasUDF</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-array-as-a-parameter-to-pandasudf/m-p/11218#M6228</link>
      <description>&lt;P&gt;Hi, it seems you do not need a pandas udf here. Try the following : &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import numpy as np
from pyspark.sql.types import FloatType
from pyspark.sql import functions as f
&amp;nbsp;
data = [{"Identifier": 123, "RawData": "1,2,4,2,34,6,7,8"},
        {"Identifier": 456, "RawData": "4,5,7,8,9,3,4,7,8"}]
&amp;nbsp;
df = spark.createDataFrame(data)
&amp;nbsp;
series_mean = f.udf(lambda x: float(np.mean(x)), FloatType()) # replace by Metric1 logic
series_max = f.udf(lambda x: float(np.max(x)), FloatType()) # replace by Metric2 logic
&amp;nbsp;
df = (df
      .withColumn("series_int", f.split(f.col('RawData'), ',').cast('array&amp;lt;int&amp;gt;'))
      .withColumn("mean", series_mean("series_int"))
      .withColumn("max", series_max("series_int"))
     )
&amp;nbsp;
display(df)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 07 Aug 2022 14:45:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-array-as-a-parameter-to-pandasudf/m-p/11218#M6228</guid>
      <dc:creator>artsheiko</dc:creator>
      <dc:date>2022-08-07T14:45:50Z</dc:date>
    </item>
    <item>
      <title>Re: passing array as a parameter to PandasUDF</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-array-as-a-parameter-to-pandasudf/m-p/11219#M6229</link>
      <description>&lt;P&gt;Hello @Kausthub NP​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 07 Sep 2022 11:36:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-array-as-a-parameter-to-pandasudf/m-p/11219#M6229</guid>
      <dc:creator>Vidula</dc:creator>
      <dc:date>2022-09-07T11:36:56Z</dc:date>
    </item>
  </channel>
</rss>

