Databricks

mohaimen_syed · ‎10-18-2023

I'm trying to do fuzzy matching on two dataframes by cross joining them and then using a udf for my fuzzy matching. But using both python udf and pandas udf its either very slow or I get an error.

@pandas_udf("int")

def core_match_processor(s1: pd.Series, s2: pd.Series) -> pd.Series:

return pd.Series(int(rapidfuzz.ratio(s1, s2)))

MatchUDF = f.pandas_udf(core_match_processor, returnType=IntegerType())

df0 = df1.crossJoin(broadcast(df2))

df = df0.withColumn("Score", MatchUDF(f.col("String1"), f.col("String2")))

Error: org.apache.spark.SparkRuntimeException: [UDF_USER_CODE_ERROR.GENERIC] Execution of function core_match_processor

Kaniz · ‎10-19-2023

Hi @mohaimen_syed, One approach to improving the performance of your fuzzy matching UDF is to use PySpark's built-in String similarity functions, such as levenshtein, soundex, or metaphone. These functions are optimized for distributed processing and can be used directly on PySpark DataFrames without the need for UDFs.

Check out this article for reference:- https://mrpowers.medium.com/fuzzy-matching-in-spark-with-soundex-and-levenshtein-distance-6749f5af8f...

mohaimen_syed · ‎10-19-2023

I'm now getting the error: (SQL_GROUPED_AGG_PANDAS_UDF) is not supported on clusters in Shared access mode.
Even though this article clearly states that pandas udf is supported for shared cluster in databricks

https://www.databricks.com/blog/shared-clusters-unity-catalog-win-introducing-cluster-libraries-pyth...