Databricks Community

mohaimen_syed · ‎10-18-2023

I'm trying to do fuzzy matching on two dataframes by cross joining them and then using a udf for my fuzzy matching. But using both python udf and pandas udf its either very slow or I get an error.

@pandas_udf("int")

def core_match_processor(s1: pd.Series, s2: pd.Series) -> pd.Series:

return pd.Series(int(rapidfuzz.ratio(s1, s2)))

MatchUDF = f.pandas_udf(core_match_processor, returnType=IntegerType())

df0 = df1.crossJoin(broadcast(df2))

df = df0.withColumn("Score", MatchUDF(f.col("String1"), f.col("String2")))

Error: org.apache.spark.SparkRuntimeException: [UDF_USER_CODE_ERROR.GENERIC] Execution of function core_match_processor

mohaimen_syed · ‎10-19-2023

I'm now getting the error: (SQL_GROUPED_AGG_PANDAS_UDF) is not supported on clusters in Shared access mode.
Even though this article clearly states that pandas udf is supported for shared cluster in databricks

https://www.databricks.com/blog/shared-clusters-unity-catalog-win-introducing-cluster-libraries-pyth...