- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-19-2022 06:56 AM
Hi,
I would like to know what you think about using the Delta Live Tables when the source for this pipeline is not incremental. What I mean by that is suppose that the data provider creates for me a new folder with files each time it has update to their data (e.g. /data/folder_1, /data/folder_2, /data/folder_3). So I need to process the entire new folder and drop old data from the previous folder each time a new update arrives.
I know that DLTs were designed for incremental data and autoloader as well. So I ended up running full refresh each time I ran the pipeline. Now I tried to not use autoload with readStream() and instead use simple pyspark read() for data ingestion into pipeline, but "Setting up tables" stage in pipeline now takes a really long time.
What do you think?