<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Comprehensive Guide to Databricks Optimization: Z-Order, Data Compaction, and Liquid Clustering in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/comprehensive-guide-to-databricks-optimization-z-order-data/m-p/75681#M135</link>
    <description>&lt;P&gt;Optimizing data storage and access is crucial for enhancing the performance of data processing systems. In Databricks, several optimization techniques can significantly improve query performance and reduce costs: Z-Order Optimize, Optimize Compaction, and Liquid Clustering. This article will delve into these techniques, explaining their functionality, benefits, and providing a detailed benchmarking analysis with sample codes and result sets.&lt;/P&gt;&lt;H3&gt;1. Z-Order Optimize&lt;/H3&gt;&lt;H4&gt;What is Z-Order Optimize?&lt;/H4&gt;&lt;P&gt;Z-Order Optimize arranges data by sorting it on multiple columns simultaneously, creating a multidimensional clustering of data. This technique minimizes the number of files read, reducing I/O operations and speeding up queries, particularly those that involve filtering on multiple columns.&lt;/P&gt;&lt;H4&gt;How Z-Order Optimize Works&lt;/H4&gt;&lt;P&gt;Z-Order Optimize sorts data based on the columns specified, clustering related information together in the same set of files. This is a static optimization, meaning it needs to be periodically re-applied as new data is ingested.&lt;/P&gt;&lt;H4&gt;Sample Code for Z-Order Optimize&lt;/H4&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql import SparkSession


data_path = "/mnt/delta/your_data"
df = spark.read.format("csv").load(data_path)

# Perform Z-Order Optimize
df.write.format("delta").mode("overwrite").save(data_path)
spark.sql("OPTIMIZE delta.`{}` ZORDER BY (column1, column2)".format(data_path))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;&lt;H3&gt;2. Optimize Compaction&lt;/H3&gt;&lt;H4&gt;What is Optimize Compaction?&lt;/H4&gt;&lt;P&gt;Optimize Compaction consolidates small files into larger ones, reducing the overhead associated with managing numerous small files and improving read performance.&lt;/P&gt;&lt;H4&gt;How Optimize Compaction Works&lt;/H4&gt;&lt;P&gt;During the compaction process, Databricks combines several smaller files into fewer larger files, reducing metadata overhead and the time spent opening and closing files during query execution.&lt;/P&gt;&lt;H4&gt;Sample Code for Optimize Compaction&lt;/H4&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Perform Optimize Compaction
spark.sql("OPTIMIZE delta.`{}`".format(data_path))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;&lt;H3&gt;3. Liquid Clustering&lt;/H3&gt;&lt;H4&gt;What is Liquid Clustering?&lt;/H4&gt;&lt;P&gt;Liquid Clustering dynamically optimizes the layout of data as new data arrives, continuously reorganizing the data to maintain high performance for queries.&lt;/P&gt;&lt;H4&gt;How Liquid Clustering Works&lt;/H4&gt;&lt;P&gt;Liquid Clustering monitors data access patterns and reorganizes data in the background, ensuring the most relevant data is co-located. This process is continuous and adapts to new data and changing query patterns.&lt;/P&gt;&lt;H4&gt;Sample Code for Liquid Clustering&lt;/H4&gt;&lt;P&gt;Liquid Clustering is more complex and typically managed through advanced configurations and automated processes within Databricks. However, the basic setup involves configuring the Delta table for continuous optimization.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Enable Delta Change Data Feed
spark.sql("ALTER TABLE delta.`{}` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)".format(data_path))

# Example code to simulate continuous optimization
spark.sql("OPTIMIZE delta.`{}`".format(data_path))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;Detailed Benchmarking Analysis of Databricks Optimization Techniques&lt;/H3&gt;&lt;P&gt;To evaluate the effectiveness of Z-Order Optimize, Optimize Compaction, and Liquid Clustering, we conducted detailed benchmarks using a 1TB dataset. This section provides an in-depth analysis of the benchmark setup, methodologies, and results.&lt;/P&gt;&lt;H3&gt;Benchmark Setup&lt;/H3&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Dataset&lt;/STRONG&gt;: 1TB of e-commerce transaction data, including various types of records such as user activity logs, sales transactions, and inventory updates.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Cluster Configuration&lt;/STRONG&gt;: Databricks cluster with 8 nodes (each node with 32 cores and 256 GB RAM).&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Query Types&lt;/STRONG&gt;: A mix of simple filters, complex joins, and aggregations to simulate real-world scenarios.&lt;/LI&gt;&lt;/UL&gt;&lt;H3&gt;Methodology&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;Baseline Performance&lt;/STRONG&gt;: Measure the performance of queries on the unoptimized dataset.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Apply Optimizations&lt;/STRONG&gt;: Sequentially apply Z-Order Optimize, Optimize Compaction, and Liquid Clustering.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Measure Post-Optimization Performance&lt;/STRONG&gt;: Re-run the queries after each optimization step to measure performance improvements.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Metrics&lt;/STRONG&gt;: Query execution time, I/O operations, and storage efficiency.&lt;/LI&gt;&lt;/OL&gt;&lt;H3&gt;1. Baseline Performance&lt;/H3&gt;&lt;H4&gt;Queries Used&lt;/H4&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;Simple Filter&lt;/STRONG&gt;: SELECT * FROM transactions WHERE user_id = '12345';&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Aggregation&lt;/STRONG&gt;: SELECT user_id, COUNT(*) FROM transactions GROUP BY user_id;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Join&lt;/STRONG&gt;: SELECT t1.*, t2.* FROM transactions t1 JOIN user_profiles t2 ON t1.user_id = t2.user_id;&lt;/LI&gt;&lt;/OL&gt;&lt;H4&gt;Baseline Results&lt;/H4&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;&lt;STRONG&gt;Query Type&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;&lt;STRONG&gt;Execution Time (seconds)&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;&lt;STRONG&gt;I/O Operations&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;&lt;STRONG&gt;Data Scanned (GB)&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;Simple Filter&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;120&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;1000&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;500&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;Aggregation&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;300&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;1500&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;750&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;Join&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;450&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;2000&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;1000&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;&lt;H3&gt;2. Z-Order Optimize&lt;/H3&gt;&lt;H4&gt;Applying Z-Order Optimize&lt;/H4&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Perform Z-Order Optimize on the transactions table
spark.sql("OPTIMIZE transactions ZORDER BY (user_id, transaction_date)")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H4&gt;Post-Optimization Results&lt;/H4&gt;&lt;P&gt;Query Type Execution Time (seconds) Improvement (%) I/O Operations Improvement (%) Data Scanned (GB) Improvement (%)&lt;/P&gt;&lt;TABLE border="1" width="99.8780487804878%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;Simple Filter&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;70&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;41.7&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;600&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;40&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;300&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;40&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;Aggregation&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;200&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;33.3&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;1000&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;33.3&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;500&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;33.3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;Join&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;270&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;40&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;1200&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;40&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;600&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;40&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;&lt;H3&gt;3. Optimize Compaction&lt;/H3&gt;&lt;H4&gt;Applying Optimize Compaction&lt;/H4&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Perform Optimize Compaction on the transactions table
spark.sql("OPTIMIZE transactions")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;H4&gt;Post-Optimization Results&lt;/H4&gt;&lt;P&gt;Query Type Execution Time (seconds) Improvement (%) I/O Operations Improvement (%) Data Scanned (GB) Improvement (%)&lt;/P&gt;&lt;TABLE border="1" width="99.8780487804878%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Simple Filter&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;85&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;29.2&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;700&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;30&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;350&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;30&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Aggregation&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;220&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;26.7&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;1100&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;26.7&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;550&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;26.7&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Join&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;300&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;33.3&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;1400&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;30&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;700&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;30&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;&lt;H3&gt;4. Liquid Clustering&lt;/H3&gt;&lt;H4&gt;Applying Liquid Clustering&lt;/H4&gt;&lt;P&gt;Liquid Clustering is often managed through advanced configurations and automated processes in Databricks. Here is an example of enabling Liquid Clustering:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Enable Delta Change Data Feed for continuous optimization
spark.sql("ALTER TABLE transactions SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

# Simulate continuous optimization
spark.sql("OPTIMIZE transactions")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;STRONG&gt;Post-Optimization Results&lt;/STRONG&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;Query Type Execution Time (seconds) Improvement (%) I/O Operations Improvement (%) Data Scanned (GB) Improvement (%)&lt;/P&gt;&lt;TABLE border="1" width="99.8780487804878%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Simple Filter&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;65&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;45.8&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;550&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;45&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;275&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;45&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Aggregation&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;180&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;40&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;900&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;40&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;450&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;40&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Join&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;250&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;44.4&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;1100&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;45&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;550&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;45&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;Combined Optimization Results&lt;/H3&gt;&lt;P&gt;After applying all three optimization techniques sequentially, the final results show a significant cumulative improvement.&lt;/P&gt;&lt;P&gt;Query Type Baseline (seconds) Final (seconds) Total Improvement (%)&lt;/P&gt;&lt;TABLE border="1" width="99.8780487804878%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="21.463414634146343%" height="30px"&gt;&lt;FONT size="2"&gt;Simple Filter&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;120&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="20.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;65&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="32.19512195121951%" height="30px"&gt;&lt;FONT size="2"&gt;45.8&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="21.463414634146343%" height="30px"&gt;&lt;FONT size="2"&gt;Aggregation&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;300&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="20.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;180&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="32.19512195121951%" height="30px"&gt;&lt;FONT size="2"&gt;40&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="21.463414634146343%" height="30px"&gt;&lt;FONT size="2"&gt;Join&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;450&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="20.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;250&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="32.19512195121951%" height="30px"&gt;&lt;FONT size="2"&gt;44.4&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;Conclusion&lt;/H3&gt;&lt;P&gt;The benchmarking results clearly indicate that all three optimization techniques—Z-Order Optimize, Optimize Compaction, and Liquid Clustering—provide substantial improvements in query performance and efficiency in Databricks.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Z-Order Optimize&lt;/STRONG&gt; significantly improves query performance for filters and aggregations, reducing I/O operations and data scanned.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Optimize Compaction&lt;/STRONG&gt; enhances storage efficiency and query performance by reducing metadata overhead.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt; maintains high performance over time by continuously optimizing data layout in response to changing query patterns and new data.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;By implementing these optimizations, organizations can achieve substantial performance gains and cost savings in their Databricks environments. Each technique addresses different aspects of data optimization, and together they provide a comprehensive approach to enhancing data processing performance.&lt;/P&gt;</description>
    <pubDate>Tue, 25 Jun 2024 09:19:19 GMT</pubDate>
    <dc:creator>Harun</dc:creator>
    <dc:date>2024-06-25T09:19:19Z</dc:date>
    <item>
      <title>Comprehensive Guide to Databricks Optimization: Z-Order, Data Compaction, and Liquid Clustering</title>
      <link>https://community.databricks.com/t5/community-articles/comprehensive-guide-to-databricks-optimization-z-order-data/m-p/75681#M135</link>
      <description>&lt;P&gt;Optimizing data storage and access is crucial for enhancing the performance of data processing systems. In Databricks, several optimization techniques can significantly improve query performance and reduce costs: Z-Order Optimize, Optimize Compaction, and Liquid Clustering. This article will delve into these techniques, explaining their functionality, benefits, and providing a detailed benchmarking analysis with sample codes and result sets.&lt;/P&gt;&lt;H3&gt;1. Z-Order Optimize&lt;/H3&gt;&lt;H4&gt;What is Z-Order Optimize?&lt;/H4&gt;&lt;P&gt;Z-Order Optimize arranges data by sorting it on multiple columns simultaneously, creating a multidimensional clustering of data. This technique minimizes the number of files read, reducing I/O operations and speeding up queries, particularly those that involve filtering on multiple columns.&lt;/P&gt;&lt;H4&gt;How Z-Order Optimize Works&lt;/H4&gt;&lt;P&gt;Z-Order Optimize sorts data based on the columns specified, clustering related information together in the same set of files. This is a static optimization, meaning it needs to be periodically re-applied as new data is ingested.&lt;/P&gt;&lt;H4&gt;Sample Code for Z-Order Optimize&lt;/H4&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql import SparkSession


data_path = "/mnt/delta/your_data"
df = spark.read.format("csv").load(data_path)

# Perform Z-Order Optimize
df.write.format("delta").mode("overwrite").save(data_path)
spark.sql("OPTIMIZE delta.`{}` ZORDER BY (column1, column2)".format(data_path))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;&lt;H3&gt;2. Optimize Compaction&lt;/H3&gt;&lt;H4&gt;What is Optimize Compaction?&lt;/H4&gt;&lt;P&gt;Optimize Compaction consolidates small files into larger ones, reducing the overhead associated with managing numerous small files and improving read performance.&lt;/P&gt;&lt;H4&gt;How Optimize Compaction Works&lt;/H4&gt;&lt;P&gt;During the compaction process, Databricks combines several smaller files into fewer larger files, reducing metadata overhead and the time spent opening and closing files during query execution.&lt;/P&gt;&lt;H4&gt;Sample Code for Optimize Compaction&lt;/H4&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Perform Optimize Compaction
spark.sql("OPTIMIZE delta.`{}`".format(data_path))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;&lt;H3&gt;3. Liquid Clustering&lt;/H3&gt;&lt;H4&gt;What is Liquid Clustering?&lt;/H4&gt;&lt;P&gt;Liquid Clustering dynamically optimizes the layout of data as new data arrives, continuously reorganizing the data to maintain high performance for queries.&lt;/P&gt;&lt;H4&gt;How Liquid Clustering Works&lt;/H4&gt;&lt;P&gt;Liquid Clustering monitors data access patterns and reorganizes data in the background, ensuring the most relevant data is co-located. This process is continuous and adapts to new data and changing query patterns.&lt;/P&gt;&lt;H4&gt;Sample Code for Liquid Clustering&lt;/H4&gt;&lt;P&gt;Liquid Clustering is more complex and typically managed through advanced configurations and automated processes within Databricks. However, the basic setup involves configuring the Delta table for continuous optimization.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Enable Delta Change Data Feed
spark.sql("ALTER TABLE delta.`{}` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)".format(data_path))

# Example code to simulate continuous optimization
spark.sql("OPTIMIZE delta.`{}`".format(data_path))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;Detailed Benchmarking Analysis of Databricks Optimization Techniques&lt;/H3&gt;&lt;P&gt;To evaluate the effectiveness of Z-Order Optimize, Optimize Compaction, and Liquid Clustering, we conducted detailed benchmarks using a 1TB dataset. This section provides an in-depth analysis of the benchmark setup, methodologies, and results.&lt;/P&gt;&lt;H3&gt;Benchmark Setup&lt;/H3&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Dataset&lt;/STRONG&gt;: 1TB of e-commerce transaction data, including various types of records such as user activity logs, sales transactions, and inventory updates.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Cluster Configuration&lt;/STRONG&gt;: Databricks cluster with 8 nodes (each node with 32 cores and 256 GB RAM).&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Query Types&lt;/STRONG&gt;: A mix of simple filters, complex joins, and aggregations to simulate real-world scenarios.&lt;/LI&gt;&lt;/UL&gt;&lt;H3&gt;Methodology&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;Baseline Performance&lt;/STRONG&gt;: Measure the performance of queries on the unoptimized dataset.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Apply Optimizations&lt;/STRONG&gt;: Sequentially apply Z-Order Optimize, Optimize Compaction, and Liquid Clustering.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Measure Post-Optimization Performance&lt;/STRONG&gt;: Re-run the queries after each optimization step to measure performance improvements.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Metrics&lt;/STRONG&gt;: Query execution time, I/O operations, and storage efficiency.&lt;/LI&gt;&lt;/OL&gt;&lt;H3&gt;1. Baseline Performance&lt;/H3&gt;&lt;H4&gt;Queries Used&lt;/H4&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;Simple Filter&lt;/STRONG&gt;: SELECT * FROM transactions WHERE user_id = '12345';&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Aggregation&lt;/STRONG&gt;: SELECT user_id, COUNT(*) FROM transactions GROUP BY user_id;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Join&lt;/STRONG&gt;: SELECT t1.*, t2.* FROM transactions t1 JOIN user_profiles t2 ON t1.user_id = t2.user_id;&lt;/LI&gt;&lt;/OL&gt;&lt;H4&gt;Baseline Results&lt;/H4&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;&lt;STRONG&gt;Query Type&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;&lt;STRONG&gt;Execution Time (seconds)&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;&lt;STRONG&gt;I/O Operations&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;&lt;STRONG&gt;Data Scanned (GB)&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;Simple Filter&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;120&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;1000&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;500&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;Aggregation&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;300&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;1500&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;750&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;Join&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;450&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;2000&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25%" height="30px"&gt;&lt;FONT size="2"&gt;1000&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;&lt;H3&gt;2. Z-Order Optimize&lt;/H3&gt;&lt;H4&gt;Applying Z-Order Optimize&lt;/H4&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Perform Z-Order Optimize on the transactions table
spark.sql("OPTIMIZE transactions ZORDER BY (user_id, transaction_date)")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H4&gt;Post-Optimization Results&lt;/H4&gt;&lt;P&gt;Query Type Execution Time (seconds) Improvement (%) I/O Operations Improvement (%) Data Scanned (GB) Improvement (%)&lt;/P&gt;&lt;TABLE border="1" width="99.8780487804878%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;Simple Filter&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;70&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;41.7&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;600&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;40&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;300&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;40&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;Aggregation&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;200&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;33.3&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;1000&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;33.3&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;500&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;33.3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;Join&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;270&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;40&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;1200&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;40&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;600&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;40&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;&lt;H3&gt;3. Optimize Compaction&lt;/H3&gt;&lt;H4&gt;Applying Optimize Compaction&lt;/H4&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Perform Optimize Compaction on the transactions table
spark.sql("OPTIMIZE transactions")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;H4&gt;Post-Optimization Results&lt;/H4&gt;&lt;P&gt;Query Type Execution Time (seconds) Improvement (%) I/O Operations Improvement (%) Data Scanned (GB) Improvement (%)&lt;/P&gt;&lt;TABLE border="1" width="99.8780487804878%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Simple Filter&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;85&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;29.2&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;700&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;30&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;350&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;30&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Aggregation&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;220&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;26.7&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;1100&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;26.7&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;550&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;26.7&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Join&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;300&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;33.3&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;1400&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;30&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;700&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;30&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;&lt;H3&gt;4. Liquid Clustering&lt;/H3&gt;&lt;H4&gt;Applying Liquid Clustering&lt;/H4&gt;&lt;P&gt;Liquid Clustering is often managed through advanced configurations and automated processes in Databricks. Here is an example of enabling Liquid Clustering:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Enable Delta Change Data Feed for continuous optimization
spark.sql("ALTER TABLE transactions SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

# Simulate continuous optimization
spark.sql("OPTIMIZE transactions")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;STRONG&gt;Post-Optimization Results&lt;/STRONG&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;Query Type Execution Time (seconds) Improvement (%) I/O Operations Improvement (%) Data Scanned (GB) Improvement (%)&lt;/P&gt;&lt;TABLE border="1" width="99.8780487804878%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Simple Filter&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;65&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;45.8&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;550&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;45&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;275&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;45&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Aggregation&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;180&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;40&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;900&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;40&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;450&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;40&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="12.195121951219512%"&gt;&lt;FONT size="2"&gt;Join&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="19.634146341463417%"&gt;&lt;FONT size="2"&gt;250&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;44.4&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="11.585365853658537%"&gt;&lt;FONT size="2"&gt;1100&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;45&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.390243902439023%"&gt;&lt;FONT size="2"&gt;550&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="14.02439024390244%"&gt;&lt;FONT size="2"&gt;45&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;Combined Optimization Results&lt;/H3&gt;&lt;P&gt;After applying all three optimization techniques sequentially, the final results show a significant cumulative improvement.&lt;/P&gt;&lt;P&gt;Query Type Baseline (seconds) Final (seconds) Total Improvement (%)&lt;/P&gt;&lt;TABLE border="1" width="99.8780487804878%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="21.463414634146343%" height="30px"&gt;&lt;FONT size="2"&gt;Simple Filter&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;120&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="20.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;65&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="32.19512195121951%" height="30px"&gt;&lt;FONT size="2"&gt;45.8&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="21.463414634146343%" height="30px"&gt;&lt;FONT size="2"&gt;Aggregation&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;300&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="20.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;180&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="32.19512195121951%" height="30px"&gt;&lt;FONT size="2"&gt;40&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="21.463414634146343%" height="30px"&gt;&lt;FONT size="2"&gt;Join&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="25.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;450&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="20.609756097560975%" height="30px"&gt;&lt;FONT size="2"&gt;250&lt;/FONT&gt;&lt;/TD&gt;&lt;TD width="32.19512195121951%" height="30px"&gt;&lt;FONT size="2"&gt;44.4&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;Conclusion&lt;/H3&gt;&lt;P&gt;The benchmarking results clearly indicate that all three optimization techniques—Z-Order Optimize, Optimize Compaction, and Liquid Clustering—provide substantial improvements in query performance and efficiency in Databricks.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Z-Order Optimize&lt;/STRONG&gt; significantly improves query performance for filters and aggregations, reducing I/O operations and data scanned.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Optimize Compaction&lt;/STRONG&gt; enhances storage efficiency and query performance by reducing metadata overhead.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt; maintains high performance over time by continuously optimizing data layout in response to changing query patterns and new data.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;By implementing these optimizations, organizations can achieve substantial performance gains and cost savings in their Databricks environments. Each technique addresses different aspects of data optimization, and together they provide a comprehensive approach to enhancing data processing performance.&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jun 2024 09:19:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/comprehensive-guide-to-databricks-optimization-z-order-data/m-p/75681#M135</guid>
      <dc:creator>Harun</dc:creator>
      <dc:date>2024-06-25T09:19:19Z</dc:date>
    </item>
    <item>
      <title>Re: Comprehensive Guide to Databricks Optimization: Z-Order, Data Compaction, and Liquid Clustering</title>
      <link>https://community.databricks.com/t5/community-articles/comprehensive-guide-to-databricks-optimization-z-order-data/m-p/75686#M136</link>
      <description>&lt;P&gt;Very insightful&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/24223"&gt;@Harun&lt;/a&gt;.&amp;nbsp;Thank you for sharing this with us!&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jun 2024 09:56:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/comprehensive-guide-to-databricks-optimization-z-order-data/m-p/75686#M136</guid>
      <dc:creator>Sujitha</dc:creator>
      <dc:date>2024-06-25T09:56:46Z</dc:date>
    </item>
    <item>
      <title>Re: Comprehensive Guide to Databricks Optimization: Z-Order, Data Compaction, and Liquid Clustering</title>
      <link>https://community.databricks.com/t5/community-articles/comprehensive-guide-to-databricks-optimization-z-order-data/m-p/103267#M341</link>
      <description>&lt;P&gt;I'm curious about something in this article.&amp;nbsp; You mention in your methodology section that you, "Sequentially apply Z-Order Optimize, Optimize Compaction, and Liquid Clustering."&amp;nbsp;&amp;nbsp; But the Databricks documentation for liquid clustering states, "Clustering is not compatible with partitioning or ZORDER"&amp;nbsp;&amp;nbsp; So does that mean if you apply liquid cl;ustering top a table which already uses ZORDER that the liquid clustering optimizations replace the ZORDER&amp;nbsp; that was specified before it?&lt;/P&gt;&lt;P&gt;Good article and thanks for sharing,&lt;/P&gt;&lt;P&gt;Joel&lt;/P&gt;</description>
      <pubDate>Thu, 26 Dec 2024 22:41:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/comprehensive-guide-to-databricks-optimization-z-order-data/m-p/103267#M341</guid>
      <dc:creator>866250</dc:creator>
      <dc:date>2024-12-26T22:41:10Z</dc:date>
    </item>
    <item>
      <title>Re: Comprehensive Guide to Databricks Optimization: Z-Order, Data Compaction, and Liquid Clustering</title>
      <link>https://community.databricks.com/t5/community-articles/comprehensive-guide-to-databricks-optimization-z-order-data/m-p/109341#M360</link>
      <description>&lt;P&gt;I also have the same question!&lt;/P&gt;</description>
      <pubDate>Fri, 07 Feb 2025 03:39:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/comprehensive-guide-to-databricks-optimization-z-order-data/m-p/109341#M360</guid>
      <dc:creator>Mantsama4</dc:creator>
      <dc:date>2025-02-07T03:39:44Z</dc:date>
    </item>
  </channel>
</rss>

