Databricks Community

Nis · ‎06-06-2024

We are storing offset details in checkpoint location wanted to know is there a way can we commit offset once we consume the message from kafka.

raphaelblg · ‎06-06-2024

Hi @Nis ,

Spark Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. Spark manages the source offsets and write them to the spark streaming query checkpoint.

So the answer is no, you cannot commit a Kafka offset through a spark structured streaming query.

Might be worth checking https://stackoverflow.com/questions/50844449/how-to-manually-set-group-id-and-commit-kafka-offsets-i....

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

View solution in original post

raphaelblg · ‎06-06-2024

Hi @Nis ,

Spark Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. Spark manages the source offsets and write them to the spark streaming query checkpoint.

So the answer is no, you cannot commit a Kafka offset through a spark structured streaming query.

Might be worth checking https://stackoverflow.com/questions/50844449/how-to-manually-set-group-id-and-commit-kafka-offsets-i....

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

ranged_coop · ‎09-08-2024

Sorry for taking it off-topic, but this behaviour of Databricks to store the offset on its own and not depend on Kafka's offset used to cause the storage to grow by a lot - I am talking some 2-3 DBR versions back - is it how it is now or is there any setting that needs to be enable to fix this ? Will it cause any issues with the history ? (I do not have any data on this now, been a long time since I worked on such a use case)

raphaelblg · ‎09-09-2024

@ranged_coop Regarding your questions:

Is there any setting that needs to be enable to fix this?
There is no setting to change this behavior, as it is a design decision and not an issue. Looks like you're referring to checkpointing. These are the docs: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-fai...
Will it cause any issues with the history?
Spark structured streaming provides exactly-once processing guarantees. How you process the data depends on the logic implemented in your state management.

Structured Streaming Programming Guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#structured-streamin...

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

raphaelblg · ‎09-10-2024

@ranged_coop In addition to my previous message, checkpointing is not a Databricks behavior as you said, checkpointing is part of open source Spark structured streaming.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

dmytro · ‎09-07-2024

Hi @raphaelblg , thanks a lot for providing an elaborate answer. Do you happen to you, by any chance, of some solutions that developers use to track a consumer lag when streaming with Spark from a Kafka topic? It's a rather essential knowledge to have to know if more spark workers are needed or more resources, etc.

Thanks in advance!

raphaelblg · ‎09-09-2024

@dmytro yes, it's possible to monitor the consumer lag through the streaming query metrics. Every cluster that runs a spark structured streaming query will log the metrics for each streaming batch in the driver logs and Spark UI. More details at Monitoring Structured Streaming queries on Databricks.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

dmytro · ‎09-10-2024

Thanks Raphael! That's helpful. I'll look into the links.

If I could ask you one more question, do you have any references or links to how upscaling and downscaling of the number of workers and cluster size is done for structured streaming? I have a use-case where the amount of data varies drastically at times and I wanted to use the consumer lag to build some scaling logic based on it.

raphaelblg · ‎09-10-2024

@dmytro,

Autoscaling is managed by Databricks and it's logic is mostly automatic. But If you're planning on structured streaming for production I suggest you to go for a fixed amount of workers and limiting your streaming query input rate or create a DLT pipeline that uses enhanced autoscaling.

This doc covers the production considerations for structured streaming workloads: https://docs.databricks.com/en/structured-streaming/production.html.

As mentioned in the docs above, when working with compute auto-scaling, the auto-scaling algorithm will have some difficulties scaling down for structured streaming workloads:

Compute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling.

Compute auto-scaling docs: https://docs.databricks.com/en/compute/configure.html#benefits-of-autoscaling

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks