cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

WAL for structured streaming

Brad
Contributor

Hi, 

I cannot find deep-dive on this from latest links. So far the understanding is:

Previously SS (structured streaming) copies and caches the data in WAL. After a version, with retrieve less, SS doesn't copy the data to WAL any more, and only stores "offset", and WAL is not being used any more and only depends on checkpoint. Is this understanding right? 

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Your understanding is partially correct. Let’s delve into the details of Structured Streaming in Apache Spark.

  1. Write-Ahead Log (WAL):

    • In the past, Structured Streaming used to copy and cache data in the Write-Ahead Log (WAL).
    • The WAL served as a reliable storage mechanism for offsets and metadata.
    • However, this approach had some limitations, including performance overhead and storage requirements.
  2. Changes with Retrieve Less:

    • After a certain version (specifically, with the introduction of “retrieve less”), Structured Streaming behavior evolved.
    • Now, Structured Streaming no longer copies the entire data to the WAL.
    • Instead, it only stores the offsets (information about the position in the data stream) in the WAL.
    • The WAL is no longer used for storing the actual data.
  3. Checkpointing:

    • Structured Streaming relies heavily on checkpointing for fault tolerance and state preservation.
    • When a streaming query is started from scratch or resumed from a checkpoint, it uses the checkpoint location.
    • The checkpoint location contains information about the latest processed offsets.
    • These offsets are retrieved from the WAL.
    • The checkpoint ensures that the state is preserved even in the event of a failure.
  4. Micro-Batch Stream Processing:

    • Structured Streaming processes data in micro-batches.
    • When a micro-batch execution starts, it populates the available offsets registry from the WAL.
    • If a checkpoint is available, it retrieves offsets from there.
    • The available offsets are added to the committed offsets.
    • The WAL is still involved in maintaining these offsets.
  5. Summary:

    • In summary, Structured Streaming has moved away from copying data to the WAL.
    • Instead, it focuses on storing and managing offsets efficiently.
    • Checkpointing plays a crucial role in ensuring fault tolerance and state consistency.

For more in-depth information, you can refer to the Internals of Spark Structured Streaming documentation1. Keep exploring, and happy streaming! 🚀

 

Thanks Kaniz. 

Theoretically even if without WAL, everything can be recovered from checkpoint right? Does the WAL exist only for perf reasons? E.g. for a micro batch, Spark might run multiple batches inside the microbatch and WAL is used to record the state of each micro micro-batch? 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!