Recommendations for loading table from two different folder paths using Autoloader and DLT
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-23-2022 02:19 PM
I have a new (bronze) table that I want to write to - the initial table load (refresh) csv file is placed in folder a, the incremental changes (inserts/updates/deletes) csv files are placed in folder b. I've written a notebook that can load one OR the other, but not both.
My intention is that I will load the table initially (folder a), then consume data changes (from folder b) as they arrive and apply_changes to that table I've loaded from folder a. So one target table with two source folders.
What is the recommendation for approaching this, what would be a good ingestion pattern for something like this?
- Labels:
-
Autoloader Approach
-
DLT
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-07-2022 01:41 PM
Kaniz, thank you for the response. Perhaps this can help, need to do more reading on ThreadPoolExecutor for Spark. The other "minor" issue I did not mention is that the files in each folder have a few mutually-exclusive metadata columns that I either exclude/omit or synthesize by including with a "withColumn". The scenario I'm trying to accommodate is the D365 Export to Data Lake which seems like it should be straight-forward but is not really.