topic Fuzzy Match on PySpark using UDF/Pandas UDF in Get Started Discussions

Fuzzy Match on PySpark using UDF/Pandas UDF

mohaimen_syed — Wed, 18 Oct 2023 22:20:49 GMT

I'm trying to do fuzzy matching on two dataframes by cross joining them and then using a udf for my fuzzy matching. But using both python udf and pandas udf its either very slow or I get an error.

@pandas_udf("int")

def core_match_processor(s1: pd.Series, s2: pd.Series) -> pd.Series:

return pd.Series(int(rapidfuzz.ratio(s1, s2)))

MatchUDF = f.pandas_udf(core_match_processor, returnType=IntegerType())

df0 = df1.crossJoin(broadcast(df2))

df = df0.withColumn("Score", MatchUDF(f.col("String1"), f.col("String2")))

Error: org.apache.spark.SparkRuntimeException: [UDF_USER_CODE_ERROR.GENERIC] Execution of function core_match_processor

Re: Fuzzy Match on PySpark using UDF/Pandas UDF

mohaimen_syed — Thu, 19 Oct 2023 22:28:14 GMT

I'm now getting the error: (SQL_GROUPED_AGG_PANDAS_UDF) is not supported on clusters in Shared access mode.
Even though this article clearly states that pandas udf is supported for shared cluster in databricks

https://www.databricks.com/blog/shared-clusters-unity-catalog-win-introducing-cluster-libraries-python-udfs-scala-machine

Re: Fuzzy Match on PySpark using UDF/Pandas UDF

mohaimen_syed — Fri, 20 Oct 2023 18:13:03 GMT

Cluster:
Policy: Shared Compute

Access: Shared

Runtime: 14.1 (includes Apache Spark 3.5.0, Scala 2.12)

Worker type: Standard_L8s_v3 (64 GB Memory, 8 Cores) - workers- 1-60
Driver type: Standard_L8s_v3 (64 GB Memory, 8 Cores)

I added this line in my python notebook:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") which I believe will enable Apache Apache Arrow optimization.

Re: Fuzzy Match on PySpark using UDF/Pandas UDF

thibault — Tue, 26 Mar 2024 10:52:31 GMT

Any updates here? I'm running into the same problem with serverless compute