cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

can we commit offset in spark structured streaming in databricks.

Nis
New Contributor II

We are storing offset details in checkpoint location wanted to know is there a way can we commit offset once we consume the message from kafka.

1 ACCEPTED SOLUTION

Accepted Solutions

raphaelblg
Databricks Employee
Databricks Employee

Hi @Nis , 

Spark Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. Spark manages the source offsets and write them to the spark streaming query checkpoint.

So the answer is no, you cannot commit a Kafka offset through a spark structured streaming query.

Might be worth checking https://stackoverflow.com/questions/50844449/how-to-manually-set-group-id-and-commit-kafka-offsets-i....

 

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

View solution in original post

8 REPLIES 8

raphaelblg
Databricks Employee
Databricks Employee

Hi @Nis , 

Spark Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. Spark manages the source offsets and write them to the spark streaming query checkpoint.

So the answer is no, you cannot commit a Kafka offset through a spark structured streaming query.

Might be worth checking https://stackoverflow.com/questions/50844449/how-to-manually-set-group-id-and-commit-kafka-offsets-i....

 

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

ranged_coop
Valued Contributor II

Sorry for taking it off-topic, but this behaviour of Databricks to store the offset on its own and not depend on Kafka's offset used to cause the storage to grow by a lot - I am talking some 2-3 DBR versions back - is it how it is now or is there any setting that needs to be enable to fix this ? Will it cause any issues with the history ? (I do not have any data on this now, been a long time since I worked on such a use case)

@ranged_coop Regarding your questions:

  • Is there any setting that needs to be enable to fix this?
    There is no setting to change this behavior, as it is a design decision and not an issue. Looks like you're referring to checkpointing. These are the docs: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-fai...
  • Will it cause any issues with the history?
    Spark structured streaming provides exactly-once processing guarantees. How you process the data depends on the logic implemented in your state management.

Structured Streaming Programming Guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#structured-streamin...

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

@ranged_coop In addition to my previous message, checkpointing is not a Databricks behavior as you said, checkpointing is part of open source Spark structured streaming.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

dmytro
New Contributor III

Hi @raphaelblg , thanks a lot for providing an elaborate answer. Do you happen to you, by any chance, of some solutions that developers use to track a consumer lag when streaming with Spark from a Kafka topic? It's a rather essential knowledge to have to know if more spark workers are needed or more resources, etc.

Thanks in advance!

raphaelblg
Databricks Employee
Databricks Employee

@dmytro yes, it's possible to monitor the consumer lag through the streaming query metrics. Every cluster that runs a spark structured streaming query will log the metrics for each streaming batch in the driver logs and Spark UI. More details at Monitoring Structured Streaming queries on Databricks.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

dmytro
New Contributor III

Thanks Raphael! That's helpful. I'll look into the links.

If I could ask you one more question, do you have any references or links to how upscaling and downscaling of the number of workers and cluster size is done for structured streaming? I have a use-case where the amount of data varies drastically at times and I wanted to use the consumer lag to build some scaling logic based on it.

raphaelblg
Databricks Employee
Databricks Employee

@dmytro

Autoscaling is managed by Databricks and it's logic is mostly automatic. But If you're planning on structured streaming for production I suggest you to go for a fixed amount of workers and limiting your streaming query input rate or create a DLT pipeline that uses enhanced autoscaling. 

This doc covers the production considerations for structured streaming workloads: https://docs.databricks.com/en/structured-streaming/production.html.

As mentioned in the docs above, when working with compute auto-scaling, the auto-scaling algorithm will have some difficulties scaling down for structured streaming workloads:

 

Compute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling.

Compute auto-scaling docs: https://docs.databricks.com/en/compute/configure.html#benefits-of-autoscaling


Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group