Databricks

sanjay · ‎03-29-2023

Hi,

I am running batch job which processes incoming files. I am trying to limit number of files in each batch process so added maxFilesPerTrigger option. But its not working. It processes all incoming files at once.

(spark.readStream.format("delta").load(silver_path)

.writeStream

.option("checkpointLocation", gold_checkpoint_path)

.option("maxFilesPerTrigger", 200)

.trigger(once=True)

.foreachBatch(foreachBatchFunction)

.start()

.awaitTermination()

)

Please suggest.

Regards,

Sanjay

Sandeep · ‎04-03-2023

@Sanjay Jain sorry missed one thing. .trigger(once=True) doesn't support rate limiters. You can use .trigger(availableNow=True) instead.

ref: https://docs.databricks.com/structured-streaming/triggers.html#configuring-incremental-batch-process...

spark.readStream.format("delta")

.option("maxFilesPerTrigger", 200)

.load(silver_path)

.writeStream

.option("checkpointLocation", gold_checkpoint_path)

.trigger(availableNow=True)

.foreachBatch(foreachBatchFunction)

.start()

View solution in original post

-werners- · ‎03-30-2023

can you try with trigger = availablenow

sanjay · ‎03-30-2023

Tried available now, but its also processing all data available for processing. I want to process in batch, max 200 files i each batch though I have 1,000 files to process.

-werners- · ‎03-30-2023

ok, how do you know that 1000 files are selected?

I ask because delta lake (your source) also stores old versions of data, which will not be sent to the stream. Physically your delta lake might have 1000 files but the current state is maybe only 150 files -> 1 microbatch.

Is that possible?

sanjay · ‎03-30-2023

I have send 1000 files to process in previous layer and I don't want to process all in one go. I can see all 1000 received in current batch

-werners- · ‎03-30-2023

I think I found the issue.

The maxfilespertrigger option has to be set on the source, not on the sink (as you do).

Try to move the option before the load statement..

so readstream.option().load()...

sanjay · ‎03-30-2023

Still getting all 1000 files.

(spark.readStream.format("delta").option("maxFilesPerTrigger", 100).load(silver_path)

.writeStream

.option("checkpointLocation", gold_checkpoint_path)

.trigger(once=True)

.foreachBatch(foreachBatchFunction)

.start()

.awaitTermination()

)

-werners- · ‎03-30-2023

spark.readStream.format("delta")

.option("maxFilesPerTrigger", "100")

.load(<table>)

.writeStream

.format("delta")

.outputMode("append")

.option("checkpointLocation", "...")

.table(<table2>)

sanjay · ‎03-30-2023

Sorry not an expert in this. But how to process my custom code..

.foreachBatch(foreachBatchFunction)

.start()

.awaitTermination()

-werners- · ‎03-30-2023

Sorry, it is basically only the part up to load() that is important.

Also try to enter the number of files as a string instead of an int.

sanjay · ‎03-30-2023

Still getting full load.

df = (spark.readStream.format("delta")

.option("maxFilesPerTrigger", "100")

.load(silver_path)

)

(df.writeStream

.option("checkpointLocation", gold_checkpoint_path)

.trigger(once=True)

.foreachBatch(foreachBatchFunction)

.start()

.awaitTermination())

-werners- · ‎03-30-2023

Can you try to also set the maxFilespertrigger in the sink?

spark.readStream.format("delta")

.option("maxFilesPerTrigger", "100")

.load(silver_path)

.writeStream

.option("checkpointLocation", gold_checkpoint_path)

.option("maxFilesPerTrigger", "100")

.trigger(once=True)

.foreachBatch(foreachBatchFunction)

.start()

-werners- · ‎03-30-2023

here is also a SO topic on how you can test it:

https://stackoverflow.com/questions/70134468/spark-structured-streaming-rate-limit

sanjay · ‎03-30-2023

no, still getting all 1000 files

-werners- · ‎03-30-2023

strange, it should work.

Databricks

How to limit number of files in each batch in streaming batch processing

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Meet the Community Team Virtually!

Data Warehousing in the Era of AI