Databricks

User16826994223 · ‎06-22-2021

In which format the Checkpoints are stored in storage and , how does it help in delta to increase performance.

User16826994223 · ‎06-22-2021

Delta Lake writes checkpoints as an aggregate state of a Delta table every 10 commits. These checkpoints serve as the starting point to compute the latest state of the table. Without checkpoints, Delta Lake would have to read a large collection of JSON files (“delta” files) representing commits to the transaction log to compute the state of a table. In addition, the column-level statistics Delta Lake uses to perform data skipping are stored in the checkpoint.

View solution in original post

User16826994223 · ‎06-22-2021

Delta Lake writes checkpoints as an aggregate state of a Delta table every 10 commits. These checkpoints serve as the starting point to compute the latest state of the table. Without checkpoints, Delta Lake would have to read a large collection of JSON files (“delta” files) representing commits to the transaction log to compute the state of a table. In addition, the column-level statistics Delta Lake uses to perform data skipping are stored in the checkpoint.

User16826994223 · ‎06-22-2021

In Databricks Runtime 7.2 and below, column-level statistics are stored in Delta Lake checkpoints as a JSON column.

In Databricks Runtime 7.3 LTS and above, column-level statistics are stored as a struct. The struct format makes Delta Lake reads much faster, because:

Delta Lake doesn’t perform expensive JSON parsing to obtain column-level statistics.
Parquet column pruning capabilities significantly reduce the I/O required to read the statistics for a column.

The struct format enables a collection of optimizations that reduce the overhead of Delta Lake read operations from seconds to tens of milliseconds, which significantly reduces the latency for short queries.

Anand_Ladda · ‎06-22-2021

Great points above on how checkpointing helps with performance. In additional Delta Lake also provides other data organization strategies such as compaction, Z-ordering to help with both read and write performance of Delta Tables. Additional details here - https://docs.databricks.com/delta/optimizations/file-mgmt.html

Databricks

Delta lake Check points storage concept

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI