topic Delta lake Check points storage concept in Data Engineering

Delta lake Check points storage concept

User16826994223 — Tue, 22 Jun 2021 11:45:08 GMT

In which format the Checkpoints are stored in storage and , how does it help in delta to increase performance.

Re: Delta lake Check points storage concept

User16826994223 — Tue, 22 Jun 2021 11:45:39 GMT

Delta Lake writes checkpoints as an aggregate state of a Delta table every 10 commits. These checkpoints serve as the starting point to compute the latest state of the table. Without checkpoints, Delta Lake would have to read a large collection of JSON files (“delta” files) representing commits to the transaction log to compute the state of a table. In addition, the column-level statistics Delta Lake uses to perform data skipping are stored in the checkpoint.

Re: Delta lake Check points storage concept

User16826994223 — Tue, 22 Jun 2021 11:46:04 GMT

In Databricks Runtime 7.2 and below, column-level statistics are stored in Delta Lake checkpoints as a JSON column.

In Databricks Runtime 7.3 LTS and above, column-level statistics are stored as a struct. The struct format makes Delta Lake reads much faster, because:

Delta Lake doesn’t perform expensive JSON parsing to obtain column-level statistics.
Parquet column pruning capabilities significantly reduce the I/O required to read the statistics for a column.

The struct format enables a collection of optimizations that reduce the overhead of Delta Lake read operations from seconds to tens of milliseconds, which significantly reduces the latency for short queries.

Re: Delta lake Check points storage concept

aladda — Wed, 23 Jun 2021 04:14:58 GMT

Great points above on how checkpointing helps with performance. In additional Delta Lake also provides other data organization strategies such as compaction, Z-ordering to help with both read and write performance of Delta Tables. Additional details here - https://docs.databricks.com/delta/optimizations/file-mgmt.html