โ12-03-2023 09:39 PM
Dear Community Members
This question is about debugging performance issue of DLT pipeline with unity catalog.
I had a DLT pipeline in Azure Databricks running on local store i.g. hive_metastore. And the processes took about 2 hour with the auto scalaing cluster configuration. To better manage the data, I refactored the process to use Unity Catalog by replacing the mount with volume and target to unity catalog schema. I was surprised to notice the new process took more than 6 hours.
Could you please let me know if there is some expected overhead by using unity catalog with DLT? I could find more information about the normal job in spark UI. But where could I find more information on the DLT process to debug the performance issue?
Thank you!
โ12-04-2023 01:10 PM
Thank you Kaniz for the information however it did not address the key point of the question i.e. how to debug the performance issue of DLT, given the same code, with different result with and without unity catalog. From the UI I can find out the outcome of the execution, i.e. the pipeline ran 6 hours but no more details such as waiting for connection, batch size, how many files processed per second, cluster resourcing utilisation etc. These information could be found in Spark UI driver log if the notebooks were executed without DLT.
Could you please refer to more detail technical insight of the DLT pipeline execution?
Thank you!
โ12-06-2023 10:26 PM
After more investigation on the spark UI, it was found that some job like "_materialization_mat_xxxxx id start at PyysicalFlow.scala:319" may take significant time e.g. hours before other jobs execution. Are there any configuration can be set to manage the behaviour?
Thanks.
โ01-23-2024 02:20 AM
Hey Harvey, I getting around the same performance problems as you:
From around 25 minutes in a normal workspace to an 1 hour and 20mins in UC workspace. Which is roughly 3x slower.
Did you manage to solve this? I've also noticed dbutil.fs.ls() is much slower by around 4-8x as well.
โ01-23-2024 09:04 PM
Hi, Mystagon
After some calls with Databricks support team, it was found out that there was a bug on source data with Volume. The workaround is to create the source without volume.
The bug may be fixed on Feb 2024.
I hope this helps.
Harvey
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now