For this particular ingestion we had no control over the directory structure as it was coming in from another team. With the right folder structure it is pretty fast but not as fast as Non-UC workspaces.
Also note directory listings using volumes instead of external locations is much faster when listing directories using dbutils.fs.ls. However I don't recommend using UC volumes with autoloader.
We opted for optimising the directory structure first and then using Autoloader which ended up saving a lot of time.
Old:
/WACS/YYYY=2024/MM=08/DD=22/TIME=1153/asset.csv
/WACS/YYYY=2024/MM=08/DD=22/TIME=1153/work.csv
New:
/WACS/asset/YYYY=2024/MM=08/DD=22/TIME=1153/asset.csv
/WACS/work/YYYY=2024/MM=08/DD=22/TIME=1153/work.csv
Also see details of my ticket below:
**Symptom**
===============
From tests I've made earlier this year there is definitely performance bottleneck occurring which is apparent when autoloading which is around 2-4x slower when using Unity Catalog vs the Common Workspace.
**Cause**
===============
Databricks engineering team find the RCA for the difference in the UC pipeline run time and non-uc pipeline run time:
Difference in the update Initialization step
- Connecting the user-defined and virtualized data flow graph. This happens so that we can construct a decomposed data flow graph which doesn’t happen for HMS pipelines.
- Creation of empty delta tables in UC. This only happens if the delta table doesn’t already exist.
Difference in the update setting up tables step
- The time spent here should ideally be comparable. The main difference is that we interact with UC rather than HMS.
- We observed that materializing tables to UC is somewhat more expensive than to HMS. This requires a bit more investigation.
- Similarly, data flow graph connections are more expensive than HMS, likely ideally because the number of flows is double (because of reconciliation flows).
Difference in the update Running step
- Auto compaction is enabled by default for UC pipelines. So, there is generally more work happening
**Suggestion**
===============
The engineering team are working on removing this latency overhead. We checked on the timeline with the engineering team, and they mentioned that there is no ETA for this now. To get the workaround, we suggest to continue with the Non-UC pipeline currently until it gets fixed.