โ12-03-2023 09:39 PM
Dear Community Members
This question is about debugging performance issue of DLT pipeline with unity catalog.
I had a DLT pipeline in Azure Databricks running on local store i.g. hive_metastore. And the processes took about 2 hour with the auto scalaing cluster configuration. To better manage the data, I refactored the process to use Unity Catalog by replacing the mount with volume and target to unity catalog schema. I was surprised to notice the new process took more than 6 hours.
Could you please let me know if there is some expected overhead by using unity catalog with DLT? I could find more information about the normal job in spark UI. But where could I find more information on the DLT process to debug the performance issue?
Thank you!
โ12-04-2023 01:43 AM
Hi @harvey-c, Certainly! Letโs explore how you can monitor and log row counts in your Delta Live Tables (DLT) pipelines.
Monitoring DLT Pipelines:
DLT provides built-in features for monitoring and observability. You can review most monitoring data manually through the pipeline details UI.
Here are some key aspects to consider:
Row Count Validation:
You can add an additional table to your pipeline that defines an expectation to compare row counts between two live tables.
The results of this expectation appear in the event log and the DLT UI.
Custom Logging:
If you need more customized logging, you can create your own logging statements within your DLT pipeline code.
For example, in Python, you can use the logging module to log messages. Hereโs a snippet:
Adjust the logger name and log messages as needed for your specific use case.
Remember that the Delta Live Tables event log contains all information related to your pipeline, including audit logs, data quality checks, pipeline progress, and data lineage. You can use this event log to track, understand, and monitor the state of your data pipelines. Happy debugging! ๐
โ12-04-2023 01:10 PM
Thank you Kaniz for the information however it did not address the key point of the question i.e. how to debug the performance issue of DLT, given the same code, with different result with and without unity catalog. From the UI I can find out the outcome of the execution, i.e. the pipeline ran 6 hours but no more details such as waiting for connection, batch size, how many files processed per second, cluster resourcing utilisation etc. These information could be found in Spark UI driver log if the notebooks were executed without DLT.
Could you please refer to more detail technical insight of the DLT pipeline execution?
Thank you!
โ12-06-2023 10:26 PM
After more investigation on the spark UI, it was found that some job like "_materialization_mat_xxxxx id start at PyysicalFlow.scala:319" may take significant time e.g. hours before other jobs execution. Are there any configuration can be set to manage the behaviour?
Thanks.
โ01-23-2024 02:20 AM
Hey Harvey, I getting around the same performance problems as you:
From around 25 minutes in a normal workspace to an 1 hour and 20mins in UC workspace. Which is roughly 3x slower.
Did you manage to solve this? I've also noticed dbutil.fs.ls() is much slower by around 4-8x as well.
โ01-23-2024 09:04 PM
Hi, Mystagon
After some calls with Databricks support team, it was found out that there was a bug on source data with Volume. The workaround is to create the source without volume.
The bug may be fixed on Feb 2024.
I hope this helps.
Harvey
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group