โ06-06-2024 05:50 AM
We are storing offset details in checkpoint location wanted to know is there a way can we commit offset once we consume the message from kafka.
โ06-06-2024 12:20 PM
Hi @Nis ,
Spark Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. Spark manages the source offsets and write them to the spark streaming query checkpoint.
So the answer is no, you cannot commit a Kafka offset through a spark structured streaming query.
Might be worth checking https://stackoverflow.com/questions/50844449/how-to-manually-set-group-id-and-commit-kafka-offsets-i....
โ06-06-2024 12:20 PM
Hi @Nis ,
Spark Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. Spark manages the source offsets and write them to the spark streaming query checkpoint.
So the answer is no, you cannot commit a Kafka offset through a spark structured streaming query.
Might be worth checking https://stackoverflow.com/questions/50844449/how-to-manually-set-group-id-and-commit-kafka-offsets-i....
โ09-08-2024 08:33 AM
Sorry for taking it off-topic, but this behaviour of Databricks to store the offset on its own and not depend on Kafka's offset used to cause the storage to grow by a lot - I am talking some 2-3 DBR versions back - is it how it is now or is there any setting that needs to be enable to fix this ? Will it cause any issues with the history ? (I do not have any data on this now, been a long time since I worked on such a use case)
โ09-09-2024 08:36 AM - edited โ09-09-2024 08:44 AM
@ranged_coop Regarding your questions:
Structured Streaming Programming Guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#structured-streamin...
โ09-10-2024 07:42 AM
@ranged_coop In addition to my previous message, checkpointing is not a Databricks behavior as you said, checkpointing is part of open source Spark structured streaming.
โ09-07-2024 04:03 AM
Hi @raphaelblg , thanks a lot for providing an elaborate answer. Do you happen to you, by any chance, of some solutions that developers use to track a consumer lag when streaming with Spark from a Kafka topic? It's a rather essential knowledge to have to know if more spark workers are needed or more resources, etc.
Thanks in advance!
โ09-09-2024 08:43 AM
@dmytro yes, it's possible to monitor the consumer lag through the streaming query metrics. Every cluster that runs a spark structured streaming query will log the metrics for each streaming batch in the driver logs and Spark UI. More details at Monitoring Structured Streaming queries on Databricks.
โ09-10-2024 01:38 AM
Thanks Raphael! That's helpful. I'll look into the links.
If I could ask you one more question, do you have any references or links to how upscaling and downscaling of the number of workers and cluster size is done for structured streaming? I have a use-case where the amount of data varies drastically at times and I wanted to use the consumer lag to build some scaling logic based on it.
โ09-10-2024 07:38 AM
@dmytro,
Autoscaling is managed by Databricks and it's logic is mostly automatic. But If you're planning on structured streaming for production I suggest you to go for a fixed amount of workers and limiting your streaming query input rate or create a DLT pipeline that uses enhanced autoscaling.
This doc covers the production considerations for structured streaming workloads: https://docs.databricks.com/en/structured-streaming/production.html.
As mentioned in the docs above, when working with compute auto-scaling, the auto-scaling algorithm will have some difficulties scaling down for structured streaming workloads:
Compute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling.
Compute auto-scaling docs: https://docs.databricks.com/en/compute/configure.html#benefits-of-autoscaling
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group