Databricks Community

AdamRink · ‎11-28-2022

I have a large stream of data read from Confluent Kafka, 500+ millions of row. When I initialize the stream I cannot control the batch sizes that are read.

I've tried setting options on the readstream - maxBytesPerTrigger, maxOffsetsPerTrigger, fetch.max.bytes, max.poll.records

Configuring spark cluster options maxRatePerPartition

Starting with a fresh checkpoint

UmaMahesh1 · ‎11-29-2022

Hi @Adam Rink

Just checking for further info on your question. How are you deducing that the batch sizes are more than what you are providing as maxOffsetsPerTrigger ?

Uma Mahesh D

AdamRink · ‎11-29-2022

Looking at the SQL Job and watching 309 mil rows and 55 hr of run time while stream status is still initializing. No data has been written to a table which is the end of the process as well.

Databricks Community

How to limit batch size from Confluent Kafka

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

How to present and share your Notebook insights in AI/BI Dashboards

Introducing an exclusively Databricks-hosted Assistant

Meet the Databricks MVPs

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs