cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Selective overwrite on Partition and Liquid clustered tables

RGSLCA
New Contributor II

Hi,

I have created 2 identical tables but one is partitioned and the one is a Liquid Clustered with Auto Clustering.

I inserted 30M rows x 2 (60M) for two dates , date 1 = 2026-06-01 and date = 2026-06-02 , then I overwrite the date 2026-06-02 with a selective overwrite statement 

REPLACE USING (BED)

For partition table , history shows ,

operationParameters 
 

object

  • mode
    "Overwrite"
  • statsOnLoad
    "false"
  • partitionBy
    "[]"
operationMetrics 
 object
  • numFiles
    "43"
  • numRemovedFiles
    "43"
  • numRemovedBytes
    "6090798096"
  • numDeletionVectorsRemoved
    "0"
  • numOutputRows
    "30000000"
  • numOutputBytes
    "6101142740"

For Liquid Clustered table 

operationParameters 
 object
  • partitionBy
    "[]"
  • clusterBy
    "[]"
  • mode
    "Overwrite"
  • clusteringOnWriteStatus
    null
  • statsOnLoad
    "false"
  • replaceUsingCols
    "(BED)"
operationMetrics 
 object
  • numFiles
    "43"
  • numRemovedFiles
    "43"
  • numRemovedBytes
    "5779903887"
  • numCopiedRows
    "0"
  • numDeletionVectorsAdded
    "0"
  • numDeletedRows
    "30000000"
  • numDeletionVectorsRemoved
    "0"
  • numAddedChangeFiles
    "0"
  • numOutputRows
    "30000000"
  • numOutputBytes
    "5779715681"
     

Its overwriting 43 files (Total 86) ?, is this optimal ?

Is there a way to improve performance by reducing the # of files ?

I am using 

INSERT INTO <target>  REPLACE USING (col) 

SELECT <cols> FROM <table> 

 
1 REPLY 1

balajij8
Contributor III

Hi, the current way is not optimal. You can follow below

INSERT query ran with mostly 43 tasks, creating 43 output files. Since the Liquid clustered table has no organization (clusterBy "[]") - dates are randomly scattered across files.

  • Partition table did a clean partition-level swap as all date='2026-06-02' data is isolated by partitioning
  • In Liquid Clustering with Auto - Data is scattered - mostly both dates mixed across files. REPLACE USING scanned all 86 files, identified which 43 files contained date='2026-06-02' followed by rewrite. It's an expensive operation (numDeletedRows: 30M)

You can follow below to reduce file count during writes

1. Control in Configuration

SET spark.sql.shuffle.partitions = 16

INSERT INTO <target> REPLACE USING (BED)
SELECT <cols> FROM <table>

2. Control in Repartition

INSERT INTO <target> REPLACE USING (BED)
SELECT /*+ REPARTITION(16) */ <col> FROM <table>;

Partitioning is simpler and faster than liquid clustering in this case (With 2 date values and full replacements). Liquid clustering wins when you have high cardinality or multi-dimensional queries.