01-23-2024 02:10 AM
01-23-2024 04:06 AM
This needs a detailed analysis to understand the root cause. But a good point to start is to compare the Spark Ui for both runs and identify which part of execution is taking time. And then we need to look at the logs.
Monday - last edited Monday
We had a similar experience where the initialisation and setting up tables step took as long as 2 hours before the pipeline starts.
It gotten worse when we migrated the pipeline to Unity Catalog. When we requested support, one of the first recommendations was to remove all "directory listing" operations.
First steps of our pipeline were traversing directories and reading CSVs into dataframes. Converting that into streaming steps brought the total initialisation + setting up tables step down to <10 minutes.
However it is a big mystery why directory listing performs terribly in Delta Live Tables, I could never get an answer to it.
Monday
For this particular ingestion we had no control over the directory structure as it was coming in from another team. With the right folder structure it is pretty fast but not as fast as Non-UC workspaces.
Also note directory listings using volumes instead of external locations is much faster when listing directories using dbutils.fs.ls. However I don't recommend using UC volumes with autoloader.
We opted for optimising the directory structure first and then using Autoloader which ended up saving a lot of time.
Old:
/WACS/YYYY=2024/MM=08/DD=22/TIME=1153/asset.csv
/WACS/YYYY=2024/MM=08/DD=22/TIME=1153/work.csv
New:
/WACS/asset/YYYY=2024/MM=08/DD=22/TIME=1153/asset.csv
/WACS/work/YYYY=2024/MM=08/DD=22/TIME=1153/work.csv
Also see details of my ticket below:
**Symptom**
===============
From tests I've made earlier this year there is definitely performance bottleneck occurring which is apparent when autoloading which is around 2-4x slower when using Unity Catalog vs the Common Workspace.
**Cause**
===============
Databricks engineering team find the RCA for the difference in the UC pipeline run time and non-uc pipeline run time:
Difference in the update Initialization step
- Connecting the user-defined and virtualized data flow graph. This happens so that we can construct a decomposed data flow graph which doesn’t happen for HMS pipelines.
- Creation of empty delta tables in UC. This only happens if the delta table doesn’t already exist.
Difference in the update setting up tables step
- The time spent here should ideally be comparable. The main difference is that we interact with UC rather than HMS.
- We observed that materializing tables to UC is somewhat more expensive than to HMS. This requires a bit more investigation.
- Similarly, data flow graph connections are more expensive than HMS, likely ideally because the number of flows is double (because of reconciliation flows).
Difference in the update Running step
- Auto compaction is enabled by default for UC pipelines. So, there is generally more work happening
**Suggestion**
===============
The engineering team are working on removing this latency overhead. We checked on the timeline with the engineering team, and they mentioned that there is no ETA for this now. To get the workaround, we suggest to continue with the Non-UC pipeline currently until it gets fixed.
Monday
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group