cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Fuzzy Match on PySpark using UDF/Pandas UDF

mohaimen_syed
New Contributor III

I'm trying to do fuzzy matching on two dataframes by cross joining them and then using a udf for my fuzzy matching. But using both python udf and pandas udf its either very slow or I get an error.

 

@pandas_udf("int")
def core_match_processor(s1: pd.Series, s2: pd.Series) -> pd.Series:
return pd.Series(int(rapidfuzz.ratio(s1, s2)))

MatchUDF = f.pandas_udf(core_match_processor, returnType=IntegerType())
 
df0 = df1.crossJoin(broadcast(df2))
df = df0.withColumn("Score", MatchUDF(f.col("String1"), f.col("String2")))
 
Error: org.apache.spark.SparkRuntimeException: [UDF_USER_CODE_ERROR.GENERIC] Execution of function core_match_processor
3 REPLIES 3

mohaimen_syed
New Contributor III

I'm now getting the error: (SQL_GROUPED_AGG_PANDAS_UDF) is not supported on clusters in Shared access mode.
Even though this article clearly states that pandas udf is supported for shared cluster in databricks

https://www.databricks.com/blog/shared-clusters-unity-catalog-win-introducing-cluster-libraries-pyth...

Cluster:
Policy: Shared Compute

Access: Shared

Runtime: 14.1 (includes Apache Spark 3.5.0, Scala 2.12)

Worker type: Standard_L8s_v3 (64 GB Memory, 8 Cores) - workers- 1-60
Driver type: Standard_L8s_v3 (64 GB Memory, 8 Cores)

I added this line in my python notebook: 

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") which I believe will enable Apache Apache Arrow optimization.

Any updates here? I'm running into the same problem with serverless compute

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group