Databricks Community

Brad · ‎10-21-2024

Hi,
I'm using runtime 15.4 LTS or 14.3 LTS. When loading a delta lake table from Kinesis, I found the delta log checkpoint is in mixing formats like:

7616 00000000000003291896.checkpoint.b1c24725-....json
7616 00000000000003291906.checkpoint.873e1b3e-....json
7616 00000000000003291916.checkpoint.e14e7613-....json
7616 00000000000003291926.checkpoint.3c9a0512-....json
7616 00000000000003291936.checkpoint.ba87e77a-....json
7653 00000000000003291936.checkpoint.parquet
7616 00000000000003291946.checkpoint.daf933a4-....json
7616 00000000000003291956.checkpoint.80768fb1-....json
7614 00000000000003291961.checkpoint.59ad2faf-....json
7614 00000000000003291971.checkpoint.ddb7a4f4-....json
7614 00000000000003291981.checkpoint.45867b1a-....json
7614 00000000000003291991.checkpoint.ec13fc70-....json

why it has mixed classic and v2 checkpoints together?

Thanks

Panda · ‎10-22-2024

@Brad

Your Delta table logs indicate a transition between classic checkpoint files (JSON) and the new v2 checkpoint format (Parquet). This behavior is managed by Delta Lake features and versioning. With Delta Lake 2.3, v2 checkpoints in Parquet were introduced to improve performance by enabling faster reading and writing compared to JSON checkpoints.

To consistently use Parquet checkpoints, set the following configurations:

spark.conf.set("spark.databricks.delta.checkpoint.writeStatsAsStruct", "true")

spark.conf.set("spark.databricks.delta.checkpoint.writeFormat", "parquet")

For compatibility with older jobs still relying on JSON checkpoints, enforce the classic format using:

spark.conf.set("spark.databricks.delta.checkpoint.writeFormat", "json")

Cleanup:

To maintain consistent checkpoints, manually trigger Delta log compaction by running the VACUUM & OPTIMIZE

Recommendation:

If you are using Databricks Runtime 15.4 LTS or 14.3 LTS, I recommend fully switching to v2 Parquet checkpoints to benefit from faster log processing.

Brad · ‎10-24-2024

Thanks. We use a job to load data from Kinesis to delta table. I added the

spark.databricks.delta.checkpoint.writeFormat parquet
spark.databricks.delta.checkpoint.writeStatsAsStruct true

in job cluster, but the checkpoints still show different formats. The table properties has set:

delta.checkpointPolicy	v2

Databricks Community

why delta log checkpoint is created in different formats

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences