Databricks Community

RGSLCA · 4 weeks ago

Hi,

I have created 2 identical tables but one is partitioned and the one is a Liquid Clustered with Auto Clustering.

I inserted 30M rows x 2 (60M) for two dates , date 1 = 2026-06-01 and date = 2026-06-02 , then I overwrite the date 2026-06-02 with a selective overwrite statement

REPLACE USING (BED)

For partition table , history shows ,

operationParameters
	object mode: "Overwrite" statsOnLoad: "false" partitionBy: "[]"
operationMetrics
	object numFiles: "43" numRemovedFiles: "43" numRemovedBytes: "6090798096" numDeletionVectorsRemoved: "0" numOutputRows: "30000000" numOutputBytes: "6101142740"

For Liquid Clustered table

operationParameters
	object partitionBy: "[]" clusterBy: "[]" mode: "Overwrite" clusteringOnWriteStatus: null statsOnLoad: "false" replaceUsingCols: "(BED)"
operationMetrics
	object numFiles: "43" numRemovedFiles: "43" numRemovedBytes: "5779903887" numCopiedRows: "0" numDeletionVectorsAdded: "0" numDeletedRows: "30000000" numDeletionVectorsRemoved: "0" numAddedChangeFiles: "0" numOutputRows: "30000000" numOutputBytes: "5779715681"

Its overwriting 43 files (Total 86) ?, is this optimal ?

Is there a way to improve performance by reducing the # of files ?

I am using

INSERT INTO <target> REPLACE USING (col)

SELECT <cols> FROM <table>

balajij8 · 3 weeks ago

Hi, the current way is not optimal. You can follow below

INSERT query ran with mostly 43 tasks, creating 43 output files. Since the Liquid clustered table has no organization (clusterBy "[]") - dates are randomly scattered across files.

Partition table did a clean partition-level swap as all date='2026-06-02' data is isolated by partitioning
In Liquid Clustering with Auto - Data is scattered - mostly both dates mixed across files. REPLACE USING scanned all 86 files, identified which 43 files contained date='2026-06-02' followed by rewrite. It's an expensive operation (numDeletedRows: 30M)

You can follow below to reduce file count during writes

1. Control in Configuration

SET spark.sql.shuffle.partitions = 16

INSERT INTO <target> REPLACE USING (BED)
SELECT <cols> FROM <table>

2. Control in Repartition

INSERT INTO <target> REPLACE USING (BED)
SELECT /*+ REPARTITION(16) */ <col> FROM <table>;

Partitioning is simpler and faster than liquid clustering in this case (With 2 date values and full replacements). Liquid clustering wins when you have high cardinality or multi-dimensional queries.

Databricks Community

Selective overwrite on Partition and Liquid clustered tables

📌‌ Complete Your Profile – Help Others Get to Know You

The Next Wave of Enterprise AI | Webinar

Databricks Community Fellows – June 2026 Recap

Solution Accelerator Series | Building Common Sense Product Recommendations With LLMs

🌟 Community Pulse: Your Weekly Roundup! June 29 – July 05, 2026