cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Performance question with Unity Catalog

harvey-c
New Contributor III

Dear Community Members

This question is about debugging performance issue of DLT pipeline with unity catalog.

I had a DLT pipeline in Azure Databricks running on local store i.g. hive_metastore. And the processes took about 2 hour with the auto scalaing cluster configuration. To better manage the data, I refactored the process to use Unity Catalog by replacing the mount with volume and target to unity catalog schema.  I was surprised to notice the new process took more than 6 hours. 

Could you please let me know if there is some expected overhead by using unity catalog with DLT? I could find more information about the normal job in spark UI. But where could I find more information on the DLT process to debug the performance issue? 

Thank you! 

5 REPLIES 5

Kaniz_Fatma
Community Manager
Community Manager

Hi @harvey-c, Certainly! Let’s explore how you can monitor and log row counts in your Delta Live Tables (DLT) pipelines.

 

Monitoring DLT Pipelines:

DLT provides built-in features for monitoring and observability. You can review most monitoring data manually through the pipeline details UI.

Here are some key aspects to consider:

  • Pipeline Graph: The pipeline graph displays dependencies between datasets in your pipeline. By default, it shows the most recent update for the table. You can select older updates from a drop-down menu. Details displayed include pipeline ID, source libraries, compute cost, product edition, Databricks Runtime version, and the channel configured for the pipeline.
  • List View: Click the “List” tab to see all datasets in your pipeline represented as rows in a table. This view is useful when your pipeline DAG is too large to visualize in the graph view.
  • Dataset Details: Clicking on a dataset in the pipeline graph or dataset list displays schema information, data quality metrics, and a link back to the source code defining the dataset.
  • Update History: To view the history and status of pipeline updates, use the update history drop-down menu. You can see the graph, details, and events for a specific update.
  • Real-time Notifications: Set up email notifications for pipeline events, such as successful completion or failure of an update.

Row Count Validation:

You can add an additional table to your pipeline that defines an expectation to compare row counts between two live tables.

The results of this expectation appear in the event log and the DLT UI.

For example, to validate equal row counts between tables tbla and tblb, you can define an expectatio...:

  • -- Validate row counts across tables SELECT COUNT(*) AS row_count_diff FROM tbla UNION ALL SELECT -COUNT(*) AS row_count_diff FROM tblb

Custom Logging:

If you need more customized logging, you can create your own logging statements within your DLT pipeline code.

For example, in Python, you can use the logging module to log messages. Here’s a snippet:

  • import logging from datetime import datetime logger = logging.getLogger("raw_zone") logger.info("Processing of landing to Raw layer has started {0}".format(datetime.now()))

Adjust the logger name and log messages as needed for your specific use case.

 

Remember that the Delta Live Tables event log contains all information related to your pipeline, including audit logs, data quality checks, pipeline progress, and data lineage. You can use this event log to track, understand, and monitor the state of your data pipelines. Happy debugging! 🚀

harvey-c
New Contributor III

Thank you Kaniz for the information however it did not address the key point of the question i.e. how to debug the performance issue of DLT, given the same code, with different result with and without unity catalog. From the UI I can find out the outcome of the execution, i.e. the pipeline ran 6 hours but no more details such as waiting for connection, batch size, how many files processed per second, cluster resourcing utilisation etc.  These information could be found in Spark UI driver log if the notebooks were executed without DLT. 

Could you please refer to more detail technical insight of the DLT pipeline execution?

Thank you!

harvey-c
New Contributor III

After more investigation on the spark UI, it was found that some job like "_materialization_mat_xxxxx id start at PyysicalFlow.scala:319" may take significant time e.g. hours before other jobs execution. Are there any configuration can be set to manage the behaviour? 

Thanks. 

Mystagon
New Contributor II

Hey Harvey, I getting around the same performance problems as you:

From around 25 minutes in a normal workspace to an 1 hour and 20mins in UC workspace. Which is roughly 3x slower.

Did you manage to solve this? I've also noticed dbutil.fs.ls() is much slower by around 4-8x as well. 

harvey-c
New Contributor III

Hi, Mystagon

After some calls with Databricks support team, it was found out that there was a bug on source data with Volume. The workaround is to create the source without volume. 

The bug may be fixed on Feb 2024.

I hope this helps.

Harvey 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group