Databricks Community

Mystagon · ‎01-23-2024

Hey I need some help / suggestions troubleshooting this, I have two DataBricks Workspaces Common and Lakehouse.

There difference between them is:

Major Differences:

- Lakehouse is using Unity Catalog

- Lakehouse is using External Locations whereas credentials are set using service principal for Common.

- Listing directories in common is at least 4-8 times faster than Lakehouse environment.

- Lakehouse is in VNET and is accessed using company VPN.

Configuration:

- Both DLT pipelines are configured the same except for catalog location and since common isn't UC it is saved to a dbfs location.

- Both DLT pipelines are reading from the same storage container. Other than that nothing else is different.

DLT Comparisons:

- Common (17m 21s) + 5 minutes because cluster is already running.

- 0m - Task Starts (Cluster already running) Initialising

- 0m 30s Setting Up Tables (7m 30s)

- 8m Graph Initialised and tables are being populated (9m 30s)

- 17m 30s All tables complete

- Lakehouse (1h 17m 23s)

- 0m - Task Starts (Spinning up Cluster)

- 4m 30s Initialising (15m 20s)

- 19m 50s Setting Up tables (47m)

- 1h 6m 50 Graph Initialised and tables are being populated (11m)

- 1h 17 23s All tables complete

I am assuming there is bottleneck somewhere but having a hard time troubleshooting it. I think it could be one of the following Unity Catalog overhead or VPN performance between Databricks and Storage Container.

Lakshay · ‎01-23-2024

This needs a detailed analysis to understand the root cause. But a good point to start is to compare the Spark Ui for both runs and identify which part of execution is taking time. And then we need to look at the logs.

kerem · ‎11-18-2024

We had a similar experience where the initialisation and setting up tables step took as long as 2 hours before the pipeline starts.

It gotten worse when we migrated the pipeline to Unity Catalog. When we requested support, one of the first recommendations was to remove all "directory listing" operations.

First steps of our pipeline were traversing directories and reading CSVs into dataframes. Converting that into streaming steps brought the total initialisation + setting up tables step down to <10 minutes.

However it is a big mystery why directory listing performs terribly in Delta Live Tables, I could never get an answer to it.

Mystagon · ‎11-18-2024

For this particular ingestion we had no control over the directory structure as it was coming in from another team. With the right folder structure it is pretty fast but not as fast as Non-UC workspaces.

Also note directory listings using volumes instead of external locations is much faster when listing directories using dbutils.fs.ls. However I don't recommend using UC volumes with autoloader.

We opted for optimising the directory structure first and then using Autoloader which ended up saving a lot of time.

Old:

/WACS/YYYY=2024/MM=08/DD=22/TIME=1153/asset.csv

/WACS/YYYY=2024/MM=08/DD=22/TIME=1153/work.csv

New:

/WACS/asset/YYYY=2024/MM=08/DD=22/TIME=1153/asset.csv

/WACS/work/YYYY=2024/MM=08/DD=22/TIME=1153/work.csv

Also see details of my ticket below:

**Symptom**

===============

From tests I've made earlier this year there is definitely performance bottleneck occurring which is apparent when autoloading which is around 2-4x slower when using Unity Catalog vs the Common Workspace.

**Cause**

===============

Databricks engineering team find the RCA for the difference in the UC pipeline run time and non-uc pipeline run time:

Difference in the update Initialization step

- Connecting the user-defined and virtualized data flow graph. This happens so that we can construct a decomposed data flow graph which doesn’t happen for HMS pipelines.
- Creation of empty delta tables in UC. This only happens if the delta table doesn’t already exist.

Difference in the update setting up tables step

- The time spent here should ideally be comparable. The main difference is that we interact with UC rather than HMS.
- We observed that materializing tables to UC is somewhat more expensive than to HMS. This requires a bit more investigation.
- Similarly, data flow graph connections are more expensive than HMS, likely ideally because the number of flows is double (because of reconciliation flows).

Difference in the update Running step

- Auto compaction is enabled by default for UC pipelines. So, there is generally more work happening

**Suggestion**

===============

The engineering team are working on removing this latency overhead. We checked on the timeline with the engineering team, and they mentioned that there is no ETA for this now. To get the workaround, we suggest to continue with the Non-UC pipeline currently until it gets fixed.

arjun_kr · ‎11-18-2024

- Listing directories in common is at least 4-8 times faster than Lakehouse environment.

Are you able to replicate the issue using simple a dbutils list operation (dbutils.fs.ls) or by performing a sample file (say 100 MB file) copy using dbutils.fs.cp? It would be good to isolate it by performing simple operations in both environments. If you can reproduce the behavior with dbutils, then try using curl to download a file by generating a pre-signed url from the blob storage. If the simple curl test can reproduce the issue, then you may need to review the network setup difference between both environments.

Databricks Community

Performance Issues with Unity Catalog

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences