Databricks Community

AdamRink · ‎11-28-2022

I have a large stream of data read from Confluent Kafka, 500+ millions of row. When I initialize the stream I cannot control the batch sizes that are read.

I've tried setting options on the readstream - maxBytesPerTrigger, maxOffsetsPerTrigger, fetch.max.bytes, max.poll.records

Configuring spark cluster options maxRatePerPartition

Starting with a fresh checkpoint

UmaMahesh1 · ‎11-29-2022

Hi @Adam Rink

Just checking for further info on your question. How are you deducing that the batch sizes are more than what you are providing as maxOffsetsPerTrigger ?

Uma Mahesh D

AdamRink · ‎11-29-2022

Looking at the SQL Job and watching 309 mil rows and 55 hr of run time while stream status is still initializing. No data has been written to a table which is the end of the process as well.

Databricks Community

How to limit batch size from Confluent Kafka

Join Us as a Local Community Builder!

Free Edition Hackathon

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Zerobus Ingest in Action: How to Stream Event Data Directly into Your Lakehouse

Find Sensitive Data at Scale with Data Classification in Unity Catalog

🚀 New: Databricks Interactive Architecture Design Workshops