- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-02-2025 03:31 AM
Thanks for your details @-werners- . I will check it out and optimize.
But actually the number of files in the clone buckets I created is only a few files less than the number of files in the buckets created by Delta Live Table and managed by Unity Catalog.
In addition, when I look at the details inside, when spark reads the buckets created by DLT, there will be an additional config:
DataFilters: [isnull(__DeleteVersion#592), (isnull(__MEETS_DROP_EXPECTATIONS#595) OR __MEETS_DROP_EXPECTATIONS..., Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[gs://cimb-prod-lakehouse/gold-layer/__unitystorage/schemas/8962e5..., PartitionFilters: [], PushedFilters: [IsNull(__DeleteVersion), Or(IsNull(__MEETS_DROP_EXPECTATIONS),EqualTo(__MEETS_DROP_EXPECTATIONS
While the clone buckets I created with the same config as the table above will not have this config:
DataFilters: []
I think this might be the reason why Spark executes so long for buckets created by DLT, but I've never encountered this before.
Any thoughts on it?
Hung Nguyen