topic Re: Performance issue with Spark SQL when working with data from Unity Catalog in Data Engineering

Performance issue with Spark SQL when working with data from Unity Catalog

minhhung0507 — Tue, 01 Apr 2025 14:30:33 GMT

We're encountering a performance issue with Spark SQL when working with data from Unity Catalog. Specifically, when I use Spark to read data from a Unity Catalog partition created by DLT and then create a view, the executor retrieval is very slow. However, if I clone the data outside of the Unity Catalog partition (still managed by DLT), the performance improves dramatically.

Has anyone seen similar behavior or can shed some light on what might be causing the discrepancy? Could it be related to metadata overhead, permission handling, or something else specific to how Unity Catalog manages partitions? Any insights or troubleshooting tips would be greatly appreciate.

Re: Performance issue with Spark SQL when working with data from Unity Catalog

-werners- — Wed, 02 Apr 2025 07:50:29 GMT

I have not yet noticed a slow down due to unity catalog itself.
I did however saw terrible performance using a shared mode cluster.
But what you can check is how the data is physically stored. perhaps this partition is skewed or written in many small files.
Running optimize on the delta table could also help.
I never use DLT because the lack of control and visibility, but that is of course my own opinion.

Re: Performance issue with Spark SQL when working with data from Unity Catalog

saurabh18cs — Wed, 02 Apr 2025 08:15:45 GMT

Is your dataset large with many partitions??

Use the DESCRIBE DETAIL command to inspect the metadata of the Unity Catalog table and the cloned table. Compare the number of partitions and other metadata details.
DESCRIBE DETAIL <unity_catalog_table>;
DESCRIBE DETAIL <cloned_table>;
If the Unity Catalog table has significantly more partitions or complex metadata, this could explain the performance difference.

and

Compare the query plans for the Unity Catalog table and the cloned table using the EXPLAIN command:
EXPLAIN EXTENDED SELECT * FROM <unity_catalog_table>;
EXPLAIN EXTENDED SELECT * FROM <cloned_table>;
Look for differences in the query plan, such as additional metadata operations or lack of partition pruning.

Re: Performance issue with Spark SQL when working with data from Unity Catalog

minhhung0507 — Wed, 02 Apr 2025 09:08:42 GMT

Let me clarify a bit, the tables are created and stored on Google Cloud Storage bucket. And I use spark to read data directly on those partitions but they are still very slow compared to cloning to another partition not managed by unity catalog.

Re: Performance issue with Spark SQL when working with data from Unity Catalog

-werners- — Wed, 02 Apr 2025 09:15:26 GMT

that is possible. that is why I mentioned looking at the physical files.
If the original partition exists of 200 small files, the clone of 1 or 4 bigger files: huge difference.

Re: Performance issue with Spark SQL when working with data from Unity Catalog

minhhung0507 — Wed, 02 Apr 2025 09:26:42 GMT

Yes, You are right, for the file storage buckets created by DLT, the number of files is a bit larger than the temporary buckets I created. But the total number of files in the data storage buckets managed by Unity Catalog is only about ~100 files.
I don't know why with such a number of files, when reading data with Spark, the speed is very slow.

Re: Performance issue with Spark SQL when working with data from Unity Catalog

-werners- — Wed, 02 Apr 2025 10:17:11 GMT

Her you can find some info.
Basically it boils down to: reading a file = overhead.
You want to minimize overhead without stressing the workers too much (gigantic partitions).

https://medium.com/globant/how-to-solve-a-large-number-of-small-files-problem-in-spark-21f819eb36d3

Re: Performance issue with Spark SQL when working with data from Unity Catalog

minhhung0507 — Wed, 02 Apr 2025 10:31:53 GMT

Thanks for your details @-werners- . I will check it out and optimize.
But actually the number of files in the clone buckets I created is only a few files less than the number of files in the buckets created by Delta Live Table and managed by Unity Catalog.
In addition, when I look at the details inside, when spark reads the buckets created by DLT, there will be an additional config:
DataFilters: [isnull(__DeleteVersion#592), (isnull(__MEETS_DROP_EXPECTATIONS#595) OR __MEETS_DROP_EXPECTATIONS..., Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[gs://cimb-prod-lakehouse/gold-layer/__unitystorage/schemas/8962e5..., PartitionFilters: [], PushedFilters: [IsNull(__DeleteVersion), Or(IsNull(__MEETS_DROP_EXPECTATIONS),EqualTo(__MEETS_DROP_EXPECTATIONS

While the clone buckets I created with the same config as the table above will not have this config:
DataFilters: []

I think this might be the reason why Spark executes so long for buckets created by DLT, but I've never encountered this before.
Any thoughts on it?

Re: Performance issue with Spark SQL when working with data from Unity Catalog

-werners- — Wed, 02 Apr 2025 10:40:57 GMT

If there is only a small difference between the number of files, that is probably not it (unless there is serious data skew).
DLT probably adds this config for data quality reasons (which might be defined on the dlt). But as I do not use dlt, I could be wrong here.
But it certainly could be the cause because the filters will be evaluated on read.

Re: Performance issue with Spark SQL when working with data from Unity Catalog

minhhung0507 — Thu, 03 Apr 2025 07:53:07 GMT

Hi @-werners- ,
In case the problem is actually caused by these filters because DLT applies these configs for data quality, is there a way to ignore these filters when reading data and ignore delta_log files when reading in that partition?
I want to verify to make sure that DLT automatically adding these filters will cause performance problems in processing data with Spark.

Re: Performance issue with Spark SQL when working with data from Unity Catalog

-werners- — Thu, 03 Apr 2025 08:47:35 GMT

you can read the physical parquet files with spark.read.parquet().

The trick is to know which files are the current ones.