cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT Performance question with Unity Catalog

harvey-c
New Contributor III

Dear Community Members

This question is about debugging performance issue of DLT pipeline with unity catalog.

I had a DLT pipeline in Azure Databricks running on local store i.g. hive_metastore. And the processes took about 2 hour with the auto scalaing cluster configuration. To better manage the data, I refactored the process to use Unity Catalog by replacing the mount with volume and target to unity catalog schema.  I was surprised to notice the new process took more than 6 hours. 

Could you please let me know if there is some expected overhead by using unity catalog with DLT? I could find more information about the normal job in spark UI. But where could I find more information on the DLT process to debug the performance issue? 

Thank you! 

4 REPLIES 4

harvey-c
New Contributor III

Thank you Kaniz for the information however it did not address the key point of the question i.e. how to debug the performance issue of DLT, given the same code, with different result with and without unity catalog. From the UI I can find out the outcome of the execution, i.e. the pipeline ran 6 hours but no more details such as waiting for connection, batch size, how many files processed per second, cluster resourcing utilisation etc.  These information could be found in Spark UI driver log if the notebooks were executed without DLT. 

Could you please refer to more detail technical insight of the DLT pipeline execution?

Thank you!

harvey-c
New Contributor III

After more investigation on the spark UI, it was found that some job like "_materialization_mat_xxxxx id start at PyysicalFlow.scala:319" may take significant time e.g. hours before other jobs execution. Are there any configuration can be set to manage the behaviour? 

Thanks. 

Mystagon
New Contributor II

Hey Harvey, I getting around the same performance problems as you:

From around 25 minutes in a normal workspace to an 1 hour and 20mins in UC workspace. Which is roughly 3x slower.

Did you manage to solve this? I've also noticed dbutil.fs.ls() is much slower by around 4-8x as well. 

harvey-c
New Contributor III

Hi, Mystagon

After some calls with Databricks support team, it was found out that there was a bug on source data with Volume. The workaround is to create the source without volume. 

The bug may be fixed on Feb 2024.

I hope this helps.

Harvey 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group