cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

BUG: Unity Catalog kills UDF

Erik_L
Contributor II

We have UDFs in a few locations and today we noticed they died in performance. This seems to be caused by Unity Catalog.

Test environment 1:

  • Databricks Runtime Environment: 14.3 / 15.1
  • Compute: 1 master, 4 nodes
  • Policy: Unrestricted
  • Access Mode: Shared

Test environment 2:

  • Databricks Runtime Environment: 14.3 / 15.1
  • Compute: Single Node
  • Policy: Unrestricted
  • Access Mode: Single user

Code:

import pandas as pd
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T

# Create test dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
sdf = spark.createDataFrame(df)
sdf.writeTo('test.playground.abcd').createOrReplace()

# Now load from unity catalog and apply UDF:
def squared(x):
    return x * x

squared_udf = F.udf(squared, T.LongType())

sdf_2 = spark.read.table('test.playground.abcd')
sdf_2.withColumn('sq', squared_udf('A')).display()

 Performance:

  • Test environment 1: 2 min 55s
  • Test environment 2: 8s
1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @Erik_L ,

It appears that you’re experiencing performance issues related to Unity Catalog in your Databricks environment.

Let’s explore some potential reasons and solutions:

  1. Mismanagement of Metastores:

    • Unity Catalog, with one metastore per region, is crucial for structured data differentiation across regions.
    • Misconfiguring metastores can lead to operational issues.
    • Databricks’ Unity Catalog addresses challenges associated with traditional metastores like Hive and Glue.

      Recommendation
      :
      • Stick to one metastore per region and use Databricks-managed Delta Sharing for data sharing across regions.
      • This setup ensures regional data isolation at the catalog level, operational consistency, and strong data governance.
      • Understand the essentials of data governance and tailor a model for your organization.
      • Designate managed storage locations at the catalog and schema levels to enforce data isolation and governance.
      • Use Unity Catalog to bind catalogs to workspaces, allowing data access only in defined areas.
      • Configure privileges and roles in the data structure for precise access control1.
  2. Inadequate Access Control and Permissions Configuration:

    • Unity Catalog’s efficient data management relies on accurate roles and access controls.
    • Ensure that you have set up appropriate roles and permissions for users and groups.
    • Regularly review and adjust access controls to maintain security and prevent performance bottlenecks.
  3. Bug in Source Data with Volume:

If you continue to experience performance problems, consider reaching out to Databricks support for further assistance. 😊🚀32