Delta Live Tables: control microbatch size
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-06-2024 10:53 PM
A delta live table pipeline reads a delta table on databricks. Is it possible to limit the size of microbatch during data transformation?
I am thinking about a solution used by spark structured streaming that enables control of batch size using:
.option("maxBytesPerTrigger", 104857600)
.option("maxFilesPerTrigger", 100) Is any similar option applicable?
- Labels:
-
Workflows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-08-2024 08:37 AM
Hi @skolukmar, Yes, you can control the size of microbatches in Delta Live Tables on Databricks using options similar to Spark Structured Streaming. You can use **`maxBytesPerTrigger`** to limit the data processed per microbatch by setting a maximum byte size, and **`maxFilesPerTrigger`** to limit the number of files considered in each trigger. For example, `.option("maxBytesPerTrigger", 104857600)` sets a 100 MB limit per microbatch, while `.option("maxFilesPerTrigger", 100)` restricts it to 100 files. These settings help manage workload and optimize pipeline performance. Is there anything specific you’re trying to achieve with these settings? Maybe I can help further!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-08-2024 10:07 AM
One other thought -- if you are considering using pandas_udf api, there is a way to control batch size there:pandas_udf guide note the comments there about arrow batch size params.