topic Re: Delta Live Tables: control microbatch size in Data Engineering

Delta Live Tables: control microbatch size

skolukmar — Wed, 07 Aug 2024 05:53:51 GMT

A delta live table pipeline reads a delta table on databricks. Is it possible to limit the size of microbatch during data transformation?

I am thinking about a solution used by spark structured streaming that enables control of batch size using:

.option("maxBytesPerTrigger", 104857600)
.option("maxFilesPerTrigger", 100)

Is any similar option applicable?

Re: Delta Live Tables: control microbatch size

Retired_mod — Thu, 08 Aug 2024 15:37:09 GMT

Hi @skolukmar, Yes, you can control the size of microbatches in Delta Live Tables on Databricks using options similar to Spark Structured Streaming. You can use **`maxBytesPerTrigger`** to limit the data processed per microbatch by setting a maximum byte size, and **`maxFilesPerTrigger`** to limit the number of files considered in each trigger. For example, `.option("maxBytesPerTrigger", 104857600)` sets a 100 MB limit per microbatch, while `.option("maxFilesPerTrigger", 100)` restricts it to 100 files. These settings help manage workload and optimize pipeline performance. Is there anything specific you’re trying to achieve with these settings? Maybe I can help further!

Re: Delta Live Tables: control microbatch size

lprevost — Thu, 08 Aug 2024 17:07:41 GMT

One other thought -- if you are considering using pandas_udf api, there is a way to control batch size there:pandas_udf guide note the comments there about arrow batch size params.