Re: Performance issue with Spark SQL when working ...

minhhung0507 · ‎04-02-2025

Thanks for your details @-werners- . I will check it out and optimize.
But actually the number of files in the clone buckets I created is only a few files less than the number of files in the buckets created by Delta Live Table and managed by Unity Catalog.
In addition, when I look at the details inside, when spark reads the buckets created by DLT, there will be an additional config:
DataFilters: [isnull(__DeleteVersion#592), (isnull(__MEETS_DROP_EXPECTATIONS#595) OR __MEETS_DROP_EXPECTATIONS..., Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[gs://cimb-prod-lakehouse/gold-layer/__unitystorage/schemas/8962e5..., PartitionFilters: [], PushedFilters: [IsNull(__DeleteVersion), Or(IsNull(__MEETS_DROP_EXPECTATIONS),EqualTo(__MEETS_DROP_EXPECTATIONS

While the clone buckets I created with the same config as the table above will not have this config:
DataFilters: []

I think this might be the reason why Spark executes so long for buckets created by DLT, but I've never encountered this before.
Any thoughts on it?

Regards,
Hung Nguyen