Using DLT pipeline with non-incremental data

140015 · ‎10-19-2022

Hi,

I would like to know what you think about using the Delta Live Tables when the source for this pipeline is not incremental. What I mean by that is suppose that the data provider creates for me a new folder with files each time it has update to their data (e.g. /data/folder_1, /data/folder_2, /data/folder_3). So I need to process the entire new folder and drop old data from the previous folder each time a new update arrives.

I know that DLTs were designed for incremental data and autoloader as well. So I ended up running full refresh each time I ran the pipeline. Now I tried to not use autoload with readStream() and instead use simple pyspark read() for data ingestion into pipeline, but "Setting up tables" stage in pipeline now takes a really long time.

What do you think?