Databricks Community

Lulka · ‎02-20-2023

Hello to everyone!

I am trying to read delta table as a streaming source using spark. But my microbatches are disbalanced - one very small and the other are very huge. How I can limit this?

I used different configurations with maxBytesPerTrigger and maxFilesPerTrigger, but nothing changes, batch size is always the same.

Are there any ideas?

df = spark \

.readStream \

.format("delta") \

.load("...")

df \

.writeStream \

.outputMode("append") \

.option("checkpointLocation", "...") \

.table("...")

Kind Regards

-werners- · ‎02-21-2023

besides the parameters you mention, I don't know of any other which controls the batch size.

did you check if the delta table is not horribly skewed?

View solution in original post

-werners- · ‎02-21-2023

besides the parameters you mention, I don't know of any other which controls the batch size.

did you check if the delta table is not horribly skewed?

Lulka · ‎02-27-2023

Thanks, you are right! Data was very skewed

Databricks Community

How limit input rate reading delta table as stream?

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences