Selective overwrite on Partition and Liquid clustered tables
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Friday - last edited Friday
Hi,
I have created 2 identical tables but one is partitioned and the one is a Liquid Clustered with Auto Clustering.
I inserted 30M rows x 2 (60M) for two dates , date 1 = 2026-06-01 and date = 2026-06-02 , then I overwrite the date 2026-06-02 with a selective overwrite statement
For partition table , history shows ,
| operationParameters | |
object
| |
| operationMetrics | |
object
|
For Liquid Clustered table
| operationParameters | |
object
| |
| operationMetrics | |
object
|
Its overwriting 43 files (Total 86) ?, is this optimal ?
Is there a way to improve performance by reducing the # of files ?
I am using
INSERT INTO <target> REPLACE USING (col)
SELECT <cols> FROM <table>
- Labels:
-
Delta Lake
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Saturday
Hi, the current way is not optimal. You can follow below
INSERT query ran with mostly 43 tasks, creating 43 output files. Since the Liquid clustered table has no organization (clusterBy "[]") - dates are randomly scattered across files.
- Partition table did a clean partition-level swap as all date='2026-06-02' data is isolated by partitioning
- In Liquid Clustering with Auto - Data is scattered - mostly both dates mixed across files. REPLACE USING scanned all 86 files, identified which 43 files contained date='2026-06-02' followed by rewrite. It's an expensive operation (numDeletedRows: 30M)
You can follow below to reduce file count during writes
1. Control in Configuration
SET spark.sql.shuffle.partitions = 16
INSERT INTO <target> REPLACE USING (BED)
SELECT <cols> FROM <table>2. Control in Repartition
INSERT INTO <target> REPLACE USING (BED)
SELECT /*+ REPARTITION(16) */ <col> FROM <table>;Partitioning is simpler and faster than liquid clustering in this case (With 2 date values and full replacements). Liquid clustering wins when you have high cardinality or multi-dimensional queries.