topic Re: can we commit offset in spark structured streaming in databricks. in Data Engineering

can we commit offset in spark structured streaming in databricks.

Nis — Thu, 06 Jun 2024 12:50:30 GMT

We are storing offset details in checkpoint location wanted to know is there a way can we commit offset once we consume the message from kafka.

Re: can we commit offset in spark structured streaming in databricks.

raphaelblg — Thu, 06 Jun 2024 19:20:22 GMT

Hi @Nis ,

Spark Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. Spark manages the source offsets and write them to the spark streaming query checkpoint.

So the answer is no, you cannot commit a Kafka offset through a spark structured streaming query.

Might be worth checking https://stackoverflow.com/questions/50844449/how-to-manually-set-group-id-and-commit-kafka-offsets-in-spark-structured-stream.

Re: can we commit offset in spark structured streaming in databricks.

dmytro — Sat, 07 Sep 2024 11:03:19 GMT

Hi @raphaelblg , thanks a lot for providing an elaborate answer. Do you happen to you, by any chance, of some solutions that developers use to track a consumer lag when streaming with Spark from a Kafka topic? It's a rather essential knowledge to have to know if more spark workers are needed or more resources, etc.

Thanks in advance!

Re: can we commit offset in spark structured streaming in databricks.

ranged_coop — Sun, 08 Sep 2024 15:33:43 GMT

Sorry for taking it off-topic, but this behaviour of Databricks to store the offset on its own and not depend on Kafka's offset used to cause the storage to grow by a lot - I am talking some 2-3 DBR versions back - is it how it is now or is there any setting that needs to be enable to fix this ? Will it cause any issues with the history ? (I do not have any data on this now, been a long time since I worked on such a use case)

Re: can we commit offset in spark structured streaming in databricks.

raphaelblg — Mon, 09 Sep 2024 15:44:08 GMT

@ranged_coop Regarding your questions:

Is there any setting that needs to be enable to fix this?
There is no setting to change this behavior, as it is a design decision and not an issue. Looks like you're referring to checkpointing. These are the docs: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing
Will it cause any issues with the history?
Spark structured streaming provides exactly-once processing guarantees. How you process the data depends on the logic implemented in your state management.

Structured Streaming Programming Guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#structured-streaming-programming-guide

Re: can we commit offset in spark structured streaming in databricks.

raphaelblg — Mon, 09 Sep 2024 15:43:02 GMT

@dmytro yes, it's possible to monitor the consumer lag through the streaming query metrics. Every cluster that runs a spark structured streaming query will log the metrics for each streaming batch in the driver logs and Spark UI. More details at Monitoring Structured Streaming queries on Databricks.

Re: can we commit offset in spark structured streaming in databricks.

dmytro — Tue, 10 Sep 2024 08:38:26 GMT

Thanks Raphael! That's helpful. I'll look into the links.

If I could ask you one more question, do you have any references or links to how upscaling and downscaling of the number of workers and cluster size is done for structured streaming? I have a use-case where the amount of data varies drastically at times and I wanted to use the consumer lag to build some scaling logic based on it.

Re: can we commit offset in spark structured streaming in databricks.

raphaelblg — Tue, 10 Sep 2024 14:38:16 GMT

@dmytro,

Autoscaling is managed by Databricks and it's logic is mostly automatic. But If you're planning on structured streaming for production I suggest you to go for a fixed amount of workers and limiting your streaming query input rate or create a DLT pipeline that uses enhanced autoscaling.

This doc covers the production considerations for structured streaming workloads: https://docs.databricks.com/en/structured-streaming/production.html.

As mentioned in the docs above, when working with compute auto-scaling, the auto-scaling algorithm will have some difficulties scaling down for structured streaming workloads:

Compute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling.

Compute auto-scaling docs: https://docs.databricks.com/en/compute/configure.html#benefits-of-autoscaling

Re: can we commit offset in spark structured streaming in databricks.

raphaelblg — Tue, 10 Sep 2024 14:42:28 GMT

@ranged_coop In addition to my previous message, checkpointing is not a Databricks behavior as you said, checkpointing is part of open source Spark structured streaming.

Re: can we commit offset in spark structured streaming in databricks.

dmytro — Thu, 09 Jan 2025 02:40:00 GMT

hi @raphaelblg ! a quick question: is it possible to write data from a DLT to a Kafka topic? Is this functionality implemented? I've seen that is a new create_sink() function, but I cannot find any information about.

Re: can we commit offset in spark structured streaming in databricks.

raphaelblg — Thu, 09 Jan 2025 18:21:12 GMT

@dmytro yes, but this feature is currently in Private Preview. Please submit a support case in https://help.databricks.com/s/ if you have interest in trying out this new feature.

Re: can we commit offset in spark structured streaming in databricks.

dmytro — Thu, 09 Jan 2025 21:08:43 GMT

thanks Raphael, i'll do so.