Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2022 11:54 PM
File Layout Optimization - tips for efficient file layout
- Leverage Delta Lake for your workloads to unlock performance capabilities such as Z-Order Clustering.
- Use Z-Order (AWS | Azure | GCP) on high cardinality columns frequently used in filters. Z-Order works similar to indexing and colocates related data to dramatically reduce the amount of data that needs to be read, which can result in dramatic performance improvements.
- Use ANALYZE TABLE COMPUTE STATISTICS (AWS | Azure | GCP) to collect column statistics for columns frequently used in joins and filters. These column statistics are used by the Databricks cost-based optimizer and Adaptive Query Execution to ensure the most optimal query plan is chosen for your workload.
- Use PARTITIONED BY (AWS | Azure | GCP) to partition your largest tables by low cardinality columns frequently used in filters (this is often the date column). We do not recommend partitioning tables less than 1 TB, as this may lead to over-partitioning your dataset. With DBR 11.2+, Ingestion Time Clustering will preserve the natural date order of your ingested data, eliminating the need for partitioning or Z-ordering on tables under 1 TB and resulting in great out-of-the-box performance.
- For workloads with frequent MERGE and write operations, your workload may benefit from Autotune based on workload (AWS | Azure | GCP) rather than the default based on table size.
- Your workload may benefit from increased parallelism by setting the spark shuffle partitions to auto (AWS | Azure | GCP)