cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

BUG: Unity Catalog kills UDF

Erik_L
Contributor II

We have UDFs in a few locations and today we noticed they died in performance. This seems to be caused by Unity Catalog.

Test environment 1:

  • Databricks Runtime Environment: 14.3 / 15.1
  • Compute: 1 master, 4 nodes
  • Policy: Unrestricted
  • Access Mode: Shared

Test environment 2:

  • Databricks Runtime Environment: 14.3 / 15.1
  • Compute: Single Node
  • Policy: Unrestricted
  • Access Mode: Single user

Code:

import pandas as pd
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T

# Create test dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
sdf = spark.createDataFrame(df)
sdf.writeTo('test.playground.abcd').createOrReplace()

# Now load from unity catalog and apply UDF:
def squared(x):
    return x * x

squared_udf = F.udf(squared, T.LongType())

sdf_2 = spark.read.table('test.playground.abcd')
sdf_2.withColumn('sq', squared_udf('A')).display()

 Performance:

  • Test environment 1: 2 min 55s
  • Test environment 2: 8s
1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @Erik_L ,

It appears that you’re experiencing performance issues related to Unity Catalog in your Databricks environment.

Let’s explore some potential reasons and solutions:

  1. Mismanagement of Metastores:

    • Unity Catalog, with one metastore per region, is crucial for structured data differentiation across regions.
    • Misconfiguring metastores can lead to operational issues.
    • Databricks’ Unity Catalog addresses challenges associated with traditional metastores like Hive and Glue.

      Recommendation
      :
      • Stick to one metastore per region and use Databricks-managed Delta Sharing for data sharing across regions.
      • This setup ensures regional data isolation at the catalog level, operational consistency, and strong data governance.
      • Understand the essentials of data governance and tailor a model for your organization.
      • Designate managed storage locations at the catalog and schema levels to enforce data isolation and governance.
      • Use Unity Catalog to bind catalogs to workspaces, allowing data access only in defined areas.
      • Configure privileges and roles in the data structure for precise access control1.
  2. Inadequate Access Control and Permissions Configuration:

    • Unity Catalog’s efficient data management relies on accurate roles and access controls.
    • Ensure that you have set up appropriate roles and permissions for users and groups.
    • Regularly review and adjust access controls to maintain security and prevent performance bottlenecks.
  3. Bug in Source Data with Volume:

If you continue to experience performance problems, consider reaching out to Databricks support for further assistance. 😊🚀32

 
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!