cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

passing array as a parameter to PandasUDF

KNP
New Contributor

Hi Team,

My python dataframe is as below.

imageThe raw data is quite a long series of approx 5000 numbers. My requirement is to go through each row in RawData column and calculate 2 metrics. I have created a function in Python and it works absolutely fine.

Python Function

def calculate_metrics(value_dict):

df = value_dict.copy()

df['Metric1'] =pd.Series(dtype='float')

df['Metric2'] = pd.Series(dtype='float')

for index,row in value_dict.iterrows():

df.loc[df['Id']==row['Id'],'Metric1'] = Function1(row['RawData'])

df.loc[df['Id']==row['Id'],'Metric2'] = Function2(row['RawData'])

return df

Here I pass a data frame value_dict with "Identifier and RawData" as two columns.

I call as below

value_dict['RawData'] = value_dict['RawData'].apply(lambda x: np.array(x))

df_Fullagg = calculate_metrics(value_dict)

This calculates all the metrics I need and returns in a dataframe.

The volumen of data is quite high here. I want to use spark frame work here with azure synapse. How can I write the same function using pandas_Udf.

I am looking for some implementation like this.

@pandas_udf('int', PandasUDFType.SCALAR)

def calculate_metrics(value_dict):

df = value_dict.copy()

df['Metric1'] =pd.Series(dtype='float')

df['Metric2'] = pd.Series(dtype='float')

for index,row in value_dict.iterrows():

df.loc[df['Id']==row['Id'],'Metric1'] = Function1(row['RawData'])

df.loc[df['Id']==row['Id'],'Metric2'] = Function2(row['RawData'])

return df

Any help would be much appreciated.

2 REPLIES 2

artsheiko
Valued Contributor III
Valued Contributor III

Hi, it seems you do not need a pandas udf here. Try the following :

import numpy as np
from pyspark.sql.types import FloatType
from pyspark.sql import functions as f
 
data = [{"Identifier": 123, "RawData": "1,2,4,2,34,6,7,8"},
        {"Identifier": 456, "RawData": "4,5,7,8,9,3,4,7,8"}]
 
df = spark.createDataFrame(data)
 
series_mean = f.udf(lambda x: float(np.mean(x)), FloatType()) # replace by Metric1 logic
series_max = f.udf(lambda x: float(np.max(x)), FloatType()) # replace by Metric2 logic
 
df = (df
      .withColumn("series_int", f.split(f.col('RawData'), ',').cast('array<int>'))
      .withColumn("mean", series_mean("series_int"))
      .withColumn("max", series_max("series_int"))
     )
 
display(df)

Vidula
Honored Contributor

Hello @Kausthub NP​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.