cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta lake Check points storage concept

User16826994223
Honored Contributor III

In which format the Checkpoints are stored in storage and , how does it help in delta to increase performance.

1 ACCEPTED SOLUTION

Accepted Solutions

User16826994223
Honored Contributor III

Delta Lake writes checkpoints as an aggregate state of a Delta table every 10 commits. These checkpoints serve as the starting point to compute the latest state of the table. Without checkpoints, Delta Lake would have to read a large collection of JSON files (“delta” files) representing commits to the transaction log to compute the state of a table. In addition, the column-level statistics Delta Lake uses to perform data skipping are stored in the checkpoint.

View solution in original post

3 REPLIES 3

User16826994223
Honored Contributor III

Delta Lake writes checkpoints as an aggregate state of a Delta table every 10 commits. These checkpoints serve as the starting point to compute the latest state of the table. Without checkpoints, Delta Lake would have to read a large collection of JSON files (“delta” files) representing commits to the transaction log to compute the state of a table. In addition, the column-level statistics Delta Lake uses to perform data skipping are stored in the checkpoint.

User16826994223
Honored Contributor III

In Databricks Runtime 7.2 and below, column-level statistics are stored in Delta Lake checkpoints as a JSON column.

In Databricks Runtime 7.3 LTS and above, column-level statistics are stored as a struct. The struct format makes Delta Lake reads much faster, because:

  • Delta Lake doesn’t perform expensive JSON parsing to obtain column-level statistics.
  • Parquet column pruning capabilities significantly reduce the I/O required to read the statistics for a column.

The struct format enables a collection of optimizations that reduce the overhead of Delta Lake read operations from seconds to tens of milliseconds, which significantly reduces the latency for short queries.

Anand_Ladda
Honored Contributor II

Great points above on how checkpointing helps with performance. In additional Delta Lake also provides other data organization strategies such as compaction, Z-ordering to help with both read and write performance of Delta Tables. Additional details here - https://docs.databricks.com/delta/optimizations/file-mgmt.html

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.