Performance issue with Spark SQL when working with data from Unity Catalog
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tuesday
We're encountering a performance issue with Spark SQL when working with data from Unity Catalog. Specifically, when I use Spark to read data from a Unity Catalog partition created by DLT and then create a view, the executor retrieval is very slow. However, if I clone the data outside of the Unity Catalog partition (still managed by DLT), the performance improves dramatically.
Has anyone seen similar behavior or can shed some light on what might be causing the discrepancy? Could it be related to metadata overhead, permission handling, or something else specific to how Unity Catalog manages partitions? Any insights or troubleshooting tips would be greatly appreciate.
Hung Nguyen
- Labels:
-
Delta Lake
-
Spark
-
Workflows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
I have not yet noticed a slow down due to unity catalog itself.
I did however saw terrible performance using a shared mode cluster.
But what you can check is how the data is physically stored. perhaps this partition is skewed or written in many small files.
Running optimize on the delta table could also help.
I never use DLT because the lack of control and visibility, but that is of course my own opinion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Is your dataset large with many partitions??
Use the DESCRIBE DETAIL command to inspect the metadata of the Unity Catalog table and the cloned table. Compare the number of partitions and other metadata details.
DESCRIBE DETAIL <unity_catalog_table>;DESCRIBE DETAIL <cloned_table>;If the Unity Catalog table has significantly more partitions or complex metadata, this could explain the performance difference.
and
Compare the query plans for the Unity Catalog table and the cloned table using the EXPLAIN command:
EXPLAIN EXTENDED SELECT * FROM <unity_catalog_table>;EXPLAIN EXTENDED SELECT * FROM <cloned_table>;Look for differences in the query plan, such as additional metadata operations or lack of partition pruning.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Let me clarify a bit, the tables are created and stored on Google Cloud Storage bucket. And I use spark to read data directly on those partitions but they are still very slow compared to cloning to another partition not managed by unity catalog.
Hung Nguyen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
that is possible. that is why I mentioned looking at the physical files.
If the original partition exists of 200 small files, the clone of 1 or 4 bigger files: huge difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Yes, You are right, for the file storage buckets created by DLT, the number of files is a bit larger than the temporary buckets I created. But the total number of files in the data storage buckets managed by Unity Catalog is only about ~100 files.
I don't know why with such a number of files, when reading data with Spark, the speed is very slow.
Hung Nguyen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Her you can find some info.
Basically it boils down to: reading a file = overhead.
You want to minimize overhead without stressing the workers too much (gigantic partitions).
https://medium.com/globant/how-to-solve-a-large-number-of-small-files-problem-in-spark-21f819eb36d3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Thanks for your details @-werners- . I will check it out and optimize.
But actually the number of files in the clone buckets I created is only a few files less than the number of files in the buckets created by Delta Live Table and managed by Unity Catalog.
In addition, when I look at the details inside, when spark reads the buckets created by DLT, there will be an additional config:
DataFilters: [isnull(__DeleteVersion#592), (isnull(__MEETS_DROP_EXPECTATIONS#595) OR __MEETS_DROP_EXPECTATIONS..., Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[gs://cimb-prod-lakehouse/gold-layer/__unitystorage/schemas/8962e5..., PartitionFilters: [], PushedFilters: [IsNull(__DeleteVersion), Or(IsNull(__MEETS_DROP_EXPECTATIONS),EqualTo(__MEETS_DROP_EXPECTATIONS
While the clone buckets I created with the same config as the table above will not have this config:
DataFilters: []
I think this might be the reason why Spark executes so long for buckets created by DLT, but I've never encountered this before.
Any thoughts on it?
Hung Nguyen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
If there is only a small difference between the number of files, that is probably not it (unless there is serious data skew).
DLT probably adds this config for data quality reasons (which might be defined on the dlt). But as I do not use dlt, I could be wrong here.
But it certainly could be the cause because the filters will be evaluated on read.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thursday
Hi @-werners- ,
In case the problem is actually caused by these filters because DLT applies these configs for data quality, is there a way to ignore these filters when reading data and ignore delta_log files when reading in that partition?
I want to verify to make sure that DLT automatically adding these filters will cause performance problems in processing data with Spark.
Hung Nguyen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thursday
you can read the physical parquet files with spark.read.parquet().
The trick is to know which files are the current ones.

