Databricks Community

Brad · ‎03-14-2024

Hi,

I cannot find deep-dive on this from latest links. So far the understanding is:

Previously SS (structured streaming) copies and caches the data in WAL. After a version, with retrieve less, SS doesn't copy the data to WAL any more, and only stores "offset", and WAL is not being used any more and only depends on checkpoint. Is this understanding right?

Kaniz · ‎03-15-2024

Your understanding is partially correct. Let’s delve into the details of Structured Streaming in Apache Spark.

Write-Ahead Log (WAL):
- In the past, Structured Streaming used to copy and cache data in the Write-Ahead Log (WAL).
- The WAL served as a reliable storage mechanism for offsets and metadata.
- However, this approach had some limitations, including performance overhead and storage requirements.
Changes with Retrieve Less:
- After a certain version (specifically, with the introduction of “retrieve less”), Structured Streaming behavior evolved.
- Now, Structured Streaming no longer copies the entire data to the WAL.
- Instead, it only stores the offsets (information about the position in the data stream) in the WAL.
- The WAL is no longer used for storing the actual data.
Checkpointing:
- Structured Streaming relies heavily on checkpointing for fault tolerance and state preservation.
- When a streaming query is started from scratch or resumed from a checkpoint, it uses the checkpoint location.
- The checkpoint location contains information about the latest processed offsets.
- These offsets are retrieved from the WAL.
- The checkpoint ensures that the state is preserved even in the event of a failure.
Micro-Batch Stream Processing:
- Structured Streaming processes data in micro-batches.
- When a micro-batch execution starts, it populates the available offsets registry from the WAL.
- If a checkpoint is available, it retrieves offsets from there.
- The available offsets are added to the committed offsets.
- The WAL is still involved in maintaining these offsets.
Summary:
- In summary, Structured Streaming has moved away from copying data to the WAL.
- Instead, it focuses on storing and managing offsets efficiently.
- Checkpointing plays a crucial role in ensuring fault tolerance and state consistency.

For more in-depth information, you can refer to the Internals of Spark Structured Streaming documentation ¹. Keep exploring, and happy streaming! 🚀

Brad · ‎03-15-2024

Thanks Kaniz.

Theoretically even if without WAL, everything can be recovered from checkpoint right? Does the WAL exist only for perf reasons? E.g. for a micro batch, Spark might run multiple batches inside the microbatch and WAL is used to record the state of each micro micro-batch?

Databricks Community

WAL for structured streaming

Earn swag at the User Community Booth at Data & AI Summit 2024!

Get Certified at Data & AI Summit and Earn this Exclusive Databricks Jacket

Databricks Test Drives - Get Help

Databricks Learning Festival (Virtual): 10 July - 24 July 2024