cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

WAL for structured streaming

Brad
Contributor

Hi, 

I cannot find deep-dive on this from latest links. So far the understanding is:

Previously SS (structured streaming) copies and caches the data in WAL. After a version, with retrieve less, SS doesn't copy the data to WAL any more, and only stores "offset", and WAL is not being used any more and only depends on checkpoint. Is this understanding right? 

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Your understanding is partially correct. Letโ€™s delve into the details of Structured Streaming in Apache Spark.

  1. Write-Ahead Log (WAL):

    • In the past, Structured Streaming used to copy and cache data in the Write-Ahead Log (WAL).
    • The WAL served as a reliable storage mechanism for offsets and metadata.
    • However, this approach had some limitations, including performance overhead and storage requirements.
  2. Changes with Retrieve Less:

    • After a certain version (specifically, with the introduction of โ€œretrieve lessโ€), Structured Streaming behavior evolved.
    • Now, Structured Streaming no longer copies the entire data to the WAL.
    • Instead, it only stores the offsets (information about the position in the data stream) in the WAL.
    • The WAL is no longer used for storing the actual data.
  3. Checkpointing:

    • Structured Streaming relies heavily on checkpointing for fault tolerance and state preservation.
    • When a streaming query is started from scratch or resumed from a checkpoint, it uses the checkpoint location.
    • The checkpoint location contains information about the latest processed offsets.
    • These offsets are retrieved from the WAL.
    • The checkpoint ensures that the state is preserved even in the event of a failure.
  4. Micro-Batch Stream Processing:

    • Structured Streaming processes data in micro-batches.
    • When a micro-batch execution starts, it populates the available offsets registry from the WAL.
    • If a checkpoint is available, it retrieves offsets from there.
    • The available offsets are added to the committed offsets.
    • The WAL is still involved in maintaining these offsets.
  5. Summary:

    • In summary, Structured Streaming has moved away from copying data to the WAL.
    • Instead, it focuses on storing and managing offsets efficiently.
    • Checkpointing plays a crucial role in ensuring fault tolerance and state consistency.

For more in-depth information, you can refer to the Internals of Spark Structured Streaming documentation1. Keep exploring, and happy streaming! ๐Ÿš€

 

Thanks Kaniz. 

Theoretically even if without WAL, everything can be recovered from checkpoint right? Does the WAL exist only for perf reasons? E.g. for a micro batch, Spark might run multiple batches inside the microbatch and WAL is used to record the state of each micro micro-batch? 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group