cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks Liquid Cluster

techuser
New Contributor III

Hi,

Is it possible to convert existing delta table with partition having data to clustering? If so can you please suggest the steps required? I tried and searched but couldn't find any. Is it that liquid clustering can be done only for new Delta tables? Please help

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @techuserCertainly! Converting an existing Delta table with partition to a Delta table with clustering is possible. Hereโ€™s a recommended approach:

  1. Backup Table A: Create a backup of your original table (Table A) as A_bkp. This ensures you have a copy of the data before making changes.

  2. Create a New Delta Table: Replace or drop the existing Table A and create a new Delta table with the desired clustering column (Column B). You can use the following SQL command to create the new table:

    CREATE TABLE Table_A
    USING DELTA
    CLUSTERED BY (Column_B)
    AS SELECT * FROM A_bkp;
    

    Replace Column_B with the actual name of the column you want to use for clustering.

  3. Insert Data: Insert data from the backup table (A_bkp) into the newly created Table A:

    INSERT INTO Table_A
    SELECT * FROM A_bkp;
    
  4. Cleanup: Finally, drop the backup table (A_bkp) and remove any associated files.

This approach allows you to transition from partitioning to clustering without losing data. Remember to thoroughly test the process in a non-production environment before applying it to your actual data. Additionally, ensure that the new clustering column is appropriate for your use case and query patte...

View solution in original post

10 REPLIES 10

Kaniz_Fatma
Community Manager
Community Manager

Hi @techuser, Certainly! Liquid clustering is a powerful feature in Delta Lake that allows you to optimize data layout without rewriting existing data. It replaces traditional table partitioning and ZORDER, making it easier to manage data and improve query performance. 

 

Let me guide you through the steps:

 

Enable Liquid Clustering:

  • Liquid clustering is available in Databricks Runtime 13.3 and above. Ensure youโ€™re using a compatible version.
  • When creating a new Delta table, add the CLUSTER BY phrase to the table creation statement. For example:CREATE TABLE my_table (col0 INT, col1 STRING) USING DELTA CLUSTER BY (col0);
  • You can also use a CTAS (Create Table As Select) statement to create a new table with clustering:CREATE EXTERNAL TABLE new_table CLUSTER BY (col0) LOCATION 'table_location' AS SELECT * FROM my_table;

Incrementally Cluster Data:

  • Once liquid clustering is enabled, you can run OPTIMIZE jobs as usual to incrementally cluster data.
  • Liquid clustering allows you to redefine clustering keys without rewriting existing data.

Changing Clustering Keys:

  • You can change clustering keys for an existing table using an ALTER TABLE command:ALTER TABLE my_table CLUSTER BY (new_col1, new_col2);
  • Subsequent OPTIMIZE and write operations will use the new clustering approach, but existing data wonโ€™t be rewritten.

Compatibility and Limitations:

  • Tables created with liquid clustering enabled have specific Delta table features and use specific writer and reader versions.
  • Note that tables with clustering enabled are not readable by Delta Lake clients that do not support all enabled Delta reader protocol table features.

Remember that liquid clustering is recommended for new Delta tables, but you can also apply it to existing tables. Happy clustering! ๐ŸŒŸ

 

For more details, refer to the official Databricks documentation on liquid clustering.

techuser
New Contributor III

@Kaniz_Fatma How this can be applied to existing delta table which is partitioned having data? Can you please suggest me the steps involved? The existing delta table is partitioned and hence the location files are all partitioned. So is it possible to convert this to cluster? In existing databricks documentation, its mentioned for NEW tables.

Hi @techuserCertainly! If you have an existing Delta table that is partitioned and you want to change the partition column, you can follow these steps:

  1. Read the Current Table:

    • First, read the existing Delta table using spark.read.table("mytable"), where "mytable" is the name of your table.
  2. Overwrite the Contents and Schema:

    • Next, overwrite the contents and schema of the table. Use the following code snippet:
      val input = spark.read.table("mytable")
      input.write.format("delta")
          .mode("overwrite")
          .option("overwriteSchema", "true")
          .partitionBy("new_partition_column") // Specify the new partition column
          .saveAsTable("mytable")
      
      Replace "new_partition_column" with the name of the new partition column you want to use.
  3. Explanation:

    • The above code reads the existing table, overwrites its contents and schema, and changes the partition column to the specified one.
    • Thereโ€™s no need to drop the table or remove files. The actual data files remain intact in the storage (e.g., S3), and only the metadata is updated.
  4. Additional Notes:

Remember to replace "mytable" and "new_partition_column" with your actual table name and desired partition column. Happy data wrangling! ๐Ÿš€

techuser
New Contributor III

Hi,

Sorry for again asking this. My requirement is not to change partition column for existing Delta table. My requirement is to change the existing delta table example Table A partitioned by Column 1, Column 2 to Table A cluster by Column 3.

Requirement is converting existing partitioned Delta table to Delta table with cluster with new column.

 

Hi @techuser, To convert an existing partitioned Delta table to a Delta table clustered by a new column, you can use the CONVERT TO DELTA command in Databricks SQL.

Hereโ€™s how you can achieve this:

  1. Read the existing table: First, read the current partitioned Delta table (letโ€™s call it โ€œmytableโ€) using Spark:

    val input = spark.read.table("mytable")
    
  2. Overwrite the contents and schema: Overwrite the table's contents and schema. Set the overwriteSchema option to true To ensure that the schema is updated:

    input.write
      .format("delta")
      .mode("overwrite")
      .option("overwriteSchema", "true")
    
  3. Cluster by the new column: Specify the new column (letโ€™s say โ€œcolBโ€) by which you want to cluster the table:

      .partitionBy("colB") // Specify the new column for clustering
      .saveAsTable("mytable") // Save the updated table
    

This process will convert your existing partitioned Delta table into a Delta table clustered by the specified new column. Remember to adjust the column names and table names according to your use case.

Keep in mind the following caveats:

  • Any file not tracked by Delta Lake becomes invisible and can be deleted during a VACUUM operation.
  • Avoid updating or appending data files during the conversion process.
  • After the table is converted, ensure that all writes go through Delta Lake.

Happy clustering! ๐ŸŒŸ๐Ÿ”

techuser
New Contributor III

While defining a new table which uses liquid cluster, we mention at the end as 'USING DELTA CLUSTER BY (Column1)'

As per above solution Point 3 Cluster by the new column, its mentioned as 

.partitionBy("colB")

How is it identified as CLUSTER? Because while creating a table we have CLUSTER BY and PARTITION BY as 2 different usage and that's how table is identified as CLUSTER or PARTITION. 

As per above explanation , does the DESCRIBE table show it as cluster?

 

Hi @techuser , When defining a new table that utilizes liquid clustering in Delta Lake, we specify the clustering key using the USING DELTA CLUSTER BY (Column1) syntax. 

However, it seems there might be some confusion regarding the distinction between clustering and partitioning.

 

Letโ€™s break it down:

 

Clustering:

  • Liquid clustering is a feature introduced by Databricks for Delta tables. It replaces traditional table partitioning and ZORDER to simplify data layout decisions and optimize query performance.
  • With liquid clustering, you can redefine clustering keys without rewriting existing data. This flexibility allows the data layout to evolve alongside your analytic needs over time.
  • Liquid clustering is particularly beneficial for scenarios such as:
    • Tables filtered by high cardinality columns.
    • Tables with skewed data distribution.
    • Rapidly growing tables that require maintenance and tuning.
    • Tables with concurrent write requirements.
    • Tables with changing access patterns.
    • Situations where a typical partition key could result in too many or too few partitions.
  • When creating a table, you enable liquid clustering by adding the CLUSTER BY phrase to the table creation statement. For example:CREATE TABLE my_table (col0 int, col1 string) USING DELTA CLUSTER BY (col0);
  • Liquid clustering is not compatible with traditional partitioning or ZORDER. It requires that the Databricks client manages all layout and optimization operations for data in the table.
  • Once enabled, you can run OPTIMIZE jobs to incrementally cluster data.

Partitioning:

  • Partitioning is a different concept. It involves dividing data into smaller subsets (partitions) based on specific column values.
  • When you use PARTITION BY during table creation, youโ€™re specifying how the data should be physically organized into partitions. Each partition contains rows with the same partition key value.
  • Partitioning is useful for improving query performance by reducing the amount of data scanned during queries. It allows for efficient pruning of irrelevant partitions.
  • Unlike liquid clustering, partitioning is a static decision made during table creation and cannot be easily changed without rewriting data.

DESCRIBE Table:

  • When you describe a Delta table using the DESCRIBE command, it provides information about the tableโ€™s schema, columns, and other metadata.
  • Unfortunately, the DESCRIBE output does not explicitly indicate whether the table uses liquid clustering. It primarily focuses on column details, data types, and other structural information.

In summary, while the DESCRIBE command wonโ€™t directly show whether a table is clustered using liquid clustering, you can infer it from the table creation statement and the presence of the CLUSTER BY clause. Liquid clustering enhances query performance and provides flexibility, making it a recommended choice for new Delta tables.

techuser
New Contributor III

Thank you for the response!

DESCRIBE table gives the column and datatype details and also the columns of PARTITION and CLUSTER. If cluster is used it mentions as Clustering Information and mention the columns used likewise for partition also.

So back to my previous question. Is there a way to Convert existing Delta Table with partition to a Delta Table with cluster.

1. Table A -- Partition column A

2. Take back up of Table A as A_bkp

3. Replace or Drop/Create Table A with Cluster Column B

4. INSERT TABLE A AS SELECT * FROM A_BKP

5. DROP A_BKP, remove the files associated

Is this a good approach?

 

Hi @techuserCertainly! Converting an existing Delta table with partition to a Delta table with clustering is possible. Hereโ€™s a recommended approach:

  1. Backup Table A: Create a backup of your original table (Table A) as A_bkp. This ensures you have a copy of the data before making changes.

  2. Create a New Delta Table: Replace or drop the existing Table A and create a new Delta table with the desired clustering column (Column B). You can use the following SQL command to create the new table:

    CREATE TABLE Table_A
    USING DELTA
    CLUSTERED BY (Column_B)
    AS SELECT * FROM A_bkp;
    

    Replace Column_B with the actual name of the column you want to use for clustering.

  3. Insert Data: Insert data from the backup table (A_bkp) into the newly created Table A:

    INSERT INTO Table_A
    SELECT * FROM A_bkp;
    
  4. Cleanup: Finally, drop the backup table (A_bkp) and remove any associated files.

This approach allows you to transition from partitioning to clustering without losing data. Remember to thoroughly test the process in a non-production environment before applying it to your actual data. Additionally, ensure that the new clustering column is appropriate for your use case and query patte...

Raja_Databricks
New Contributor III

Does Liquid Clustering accepts Merge or How Upsert can be done efficiently with Liquid clustered delta table

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!