Databricks Community

harvey-c · ‎12-03-2023

Dear Community Members

This question is about debugging performance issue of DLT pipeline with unity catalog.

I had a DLT pipeline in Azure Databricks running on local store i.g. hive_metastore. And the processes took about 2 hour with the auto scalaing cluster configuration. To better manage the data, I refactored the process to use Unity Catalog by replacing the mount with volume and target to unity catalog schema. I was surprised to notice the new process took more than 6 hours.

Could you please let me know if there is some expected overhead by using unity catalog with DLT? I could find more information about the normal job in spark UI. But where could I find more information on the DLT process to debug the performance issue?

Thank you!

harvey-c · ‎12-04-2023

Thank you Kaniz for the information however it did not address the key point of the question i.e. how to debug the performance issue of DLT, given the same code, with different result with and without unity catalog. From the UI I can find out the outcome of the execution, i.e. the pipeline ran 6 hours but no more details such as waiting for connection, batch size, how many files processed per second, cluster resourcing utilisation etc. These information could be found in Spark UI driver log if the notebooks were executed without DLT.

Could you please refer to more detail technical insight of the DLT pipeline execution?

Thank you!

harvey-c · ‎12-06-2023

After more investigation on the spark UI, it was found that some job like "_materialization_mat_xxxxx id start at PyysicalFlow.scala:319" may take significant time e.g. hours before other jobs execution. Are there any configuration can be set to manage the behaviour?

Thanks.

Mystagon · ‎01-23-2024

Hey Harvey, I getting around the same performance problems as you:

From around 25 minutes in a normal workspace to an 1 hour and 20mins in UC workspace. Which is roughly 3x slower.

Did you manage to solve this? I've also noticed dbutil.fs.ls() is much slower by around 4-8x as well.

harvey-c · ‎01-23-2024

Hi, Mystagon

After some calls with Databricks support team, it was found out that there was a bug on source data with Volume. The workaround is to create the source without volume.

The bug may be fixed on Feb 2024.

I hope this helps.

Harvey