Re: how to process a streaming lakeflow declarativ...

Michał · ‎09-03-2025

Thanks @szymon_dybczak. From my experiments so far, you can set `maxFilesPerTrigger`, `maxBytesPerTrigger` and other settings in both Python and SQL code when you declare streaming tables in declarative pipelines.,However, I don't see any evidence they are actually taken into account.

The way I read Structured Streaming Programming Guide - Spark 4.0.0 Documentation was that in case of a failure (I'm not stopping the runs manually, it happens as a result of a failure) the progress of micro batches processed to date should be preserved. Running individual micro-batches was a way to test it and replicate a problem I had in our production system. But your explanation makes sense, and also I have run some local spark tests to better understand the behaviour. Thanks.

So perhaps my question should have been about checkpoints. I assumed that declarative pipelines are using checkpoints behind the scenes - I don't think we can set them explicitly in declarative pipelines - but the behaviour of my pipelines suggests they are not set or not working as I would expect. If I have a pipeline that runs for hours, and eventually fails, I end up with no data in my sink tables despite the first task in the pipeline have read by that point hundreds of GB of data, and the fact that the processing is relatively simply, one row at a time, no aggregation, no ordering, nothing that would require to analyse the full source before writing. What am I missing?