isaac_gritz
Databricks Employee
Databricks Employee

File Layout Optimization - tips for efficient file layout

  1. Leverage Delta Lake for your workloads to unlock performance capabilities such as Z-Order Clustering.
  2. Use Z-Order (AWS | Azure | GCP) on high cardinality columns frequently used in filters. Z-Order works similar to indexing and colocates related data to dramatically reduce the amount of data that needs to be read, which can result in dramatic performance improvements.
  3. Use ANALYZE TABLE COMPUTE STATISTICS (AWS | Azure | GCP) to collect column statistics for columns frequently used in joins and filters. These column statistics are used by the Databricks cost-based optimizer and Adaptive Query Execution to ensure the most optimal query plan is chosen for your workload.
  4. Use PARTITIONED BY (AWS | Azure | GCP) to partition your largest tables by low cardinality columns frequently used in filters (this is often the date column). We do not recommend partitioning tables less than 1 TB, as this may lead to over-partitioning your dataset. With DBR 11.2+, Ingestion Time Clustering will preserve the natural date order of your ingested data, eliminating the need for partitioning or Z-ordering on tables under 1 TB and resulting in great out-of-the-box performance.
  5. For workloads with frequent MERGE and write operations, your workload may benefit from Autotune based on workload (AWS | Azure | GCP) rather than the default based on table size.
  6. Your workload may benefit from increased parallelism by setting the spark shuffle partitions to auto (AWS | Azure | GCP)