cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Checkpoint Location Error

AanchalSoni
Contributor
 

Hi!

I'm facing an error related to Checkpoint whenever I try to display a dataframe using auto Loader in Databricks free edition. Please refer the screenshot. To combat this, I have to delete the checkpoint folder and then execute the display or writestream command. Can someone help me understand the root cause and how can I overcome this?

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @AanchalSoni,

No problem asking questions. That's what this forum is for.

You donโ€™t need a brandโ€‘new checkpoint for every tiny code change, but you should treat a checkpoint as belonging to one specific logical stream configuration.

A more precise rule of thumb is that it is safe to reuse the same checkpointLocation when the query is logically the same, such as having the same input, stateful operators (agg/join/dedup), output mode, keys, and watermarks. Alternatively, it is safe when you are just restarting the cluster or rerunning the same notebook cell.

Use a new checkpointLocation (or delete the old one) when you change output mode (append โ†” complete/update) or add/remove stateful operations (aggregations, streamโ€‘stream joins, mapGroupsWithState, dedup with watermark), or when you significantly alter the query shape in a way that affects state.

In your specific use case ("Iโ€™m just validating transformations, trying out different versions"):

  • Yes, that usually does mean using a different checkpoint path per meaningfully different version of the stream, instead of reusing the same one and fighting mysterious errors.
  • The overhead is mostly a few extra small directories and files in storage. For exploratory work, that cost is negligible compared to the cost of your time and cluster.

A practical pattern to keep it manageable:

/Volumes/.../checkpoints/accounts/        # base
  display_v1/
  display_v2_with_agg/
  write_to_delta_append/
  write_to_delta_complete/
In summary, you donโ€™t need a new checkpoint for every single tweak but when you change the type of stream (e.g., add an aggregation, switch output mode), treat it as a new logical stream and give it its own checkpointLocation, you should consider using a different checkpoint. Thatโ€™s what avoids the error youโ€™re seeing and the need to keep deleting the folder.
 
Does this help?

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.
Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

4 REPLIES 4

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @AanchalSoni,

I canโ€™t see the full history of your notebook, so Iโ€™m not sure of the exact cause. But the behaviour strongly suggests that an earlier version of the stream used complete mode against the same checkpointLocation, and that configuration is whatโ€™s causing the error now.

Your current call is display(accounts_df, output_mode="append", checkpointLocation=".../Checkpoint/")

The error, however, says Invalid streaming output mode: complete. This output mode is not supported for no streaming aggregations...

For a nonโ€‘aggregated stream in append mode, Spark wouldnโ€™t complain about complete unless it was reading that mode from somewhere else. In Structured Streaming, the only source of this is the checkpoint metadata. The checkpoint stores the original query plan, including the output mode. When you reuse the same checkpoint path with a changed query (no agg + append), Spark detects a mismatch between the stored configuration (complete) and the new query, and throws STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION. When you delete the checkpoint, you erase that metadata, and the stream starts clean, which is why it fixes the issue.

The recommendation is not to reuse a checkpoint path across different query shapes or output modes. Give each logical stream (and each output mode) its own checkpointLocation.

Can you confirm whether you have any other processing steps before the cell shown in the snapshot?

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

Hi Ashwin!

Thanks for your response. Before the screenshot step, I'm just reading the file with explicit schema. 

When you say 'The recommendation is not to reuse a checkpoint path across different query shapes or output modes. Give each logical stream (and each output mode) its own checkpointLocation.'- does this mean, that even if I'm validating the output, I should create a new checkpoint location? Wouldn't this be an overhead while working on multiple transformations?

Please bear with me, my questions come from a naive background and I'm trying to understand these showstoppers at my best.

Hi @AanchalSoni,

No problem asking questions. That's what this forum is for.

You donโ€™t need a brandโ€‘new checkpoint for every tiny code change, but you should treat a checkpoint as belonging to one specific logical stream configuration.

A more precise rule of thumb is that it is safe to reuse the same checkpointLocation when the query is logically the same, such as having the same input, stateful operators (agg/join/dedup), output mode, keys, and watermarks. Alternatively, it is safe when you are just restarting the cluster or rerunning the same notebook cell.

Use a new checkpointLocation (or delete the old one) when you change output mode (append โ†” complete/update) or add/remove stateful operations (aggregations, streamโ€‘stream joins, mapGroupsWithState, dedup with watermark), or when you significantly alter the query shape in a way that affects state.

In your specific use case ("Iโ€™m just validating transformations, trying out different versions"):

  • Yes, that usually does mean using a different checkpoint path per meaningfully different version of the stream, instead of reusing the same one and fighting mysterious errors.
  • The overhead is mostly a few extra small directories and files in storage. For exploratory work, that cost is negligible compared to the cost of your time and cluster.

A practical pattern to keep it manageable:

/Volumes/.../checkpoints/accounts/        # base
  display_v1/
  display_v2_with_agg/
  write_to_delta_append/
  write_to_delta_complete/
In summary, you donโ€™t need a new checkpoint for every single tweak but when you change the type of stream (e.g., add an aggregation, switch output mode), treat it as a new logical stream and give it its own checkpointLocation, you should consider using a different checkpoint. Thatโ€™s what avoids the error youโ€™re seeing and the need to keep deleting the folder.
 
Does this help?

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.
Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

Thanks Ashwin! And yes your explanation about Checkpoints was clear. I could comprehend the relevance of Checkpoints.