cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

WAL for structured streaming

Brad
Contributor

Hi, 

I cannot find deep-dive on this from latest links. So far the understanding is:

Previously SS (structured streaming) copies and caches the data in WAL. After a version, with retrieve less, SS doesn't copy the data to WAL any more, and only stores "offset", and WAL is not being used any more and only depends on checkpoint. Is this understanding right? 

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Your understanding is partially correct. Letโ€™s delve into the details of Structured Streaming in Apache Spark.

  1. Write-Ahead Log (WAL):

    • In the past, Structured Streaming used to copy and cache data in the Write-Ahead Log (WAL).
    • The WAL served as a reliable storage mechanism for offsets and metadata.
    • However, this approach had some limitations, including performance overhead and storage requirements.
  2. Changes with Retrieve Less:

    • After a certain version (specifically, with the introduction of โ€œretrieve lessโ€), Structured Streaming behavior evolved.
    • Now, Structured Streaming no longer copies the entire data to the WAL.
    • Instead, it only stores the offsets (information about the position in the data stream) in the WAL.
    • The WAL is no longer used for storing the actual data.
  3. Checkpointing:

    • Structured Streaming relies heavily on checkpointing for fault tolerance and state preservation.
    • When a streaming query is started from scratch or resumed from a checkpoint, it uses the checkpoint location.
    • The checkpoint location contains information about the latest processed offsets.
    • These offsets are retrieved from the WAL.
    • The checkpoint ensures that the state is preserved even in the event of a failure.
  4. Micro-Batch Stream Processing:

    • Structured Streaming processes data in micro-batches.
    • When a micro-batch execution starts, it populates the available offsets registry from the WAL.
    • If a checkpoint is available, it retrieves offsets from there.
    • The available offsets are added to the committed offsets.
    • The WAL is still involved in maintaining these offsets.
  5. Summary:

    • In summary, Structured Streaming has moved away from copying data to the WAL.
    • Instead, it focuses on storing and managing offsets efficiently.
    • Checkpointing plays a crucial role in ensuring fault tolerance and state consistency.

For more in-depth information, you can refer to the Internals of Spark Structured Streaming documentation1. Keep exploring, and happy streaming! ๐Ÿš€

 

Thanks Kaniz. 

Theoretically even if without WAL, everything can be recovered from checkpoint right? Does the WAL exist only for perf reasons? E.g. for a micro batch, Spark might run multiple batches inside the microbatch and WAL is used to record the state of each micro micro-batch? 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.