Re: Recommendations for loading table from two dif...

bblakey · ‎08-23-2022

I have a new (bronze) table that I want to write to - the initial table load (refresh) csv file is placed in folder a, the incremental changes (inserts/updates/deletes) csv files are placed in folder b. I've written a notebook that can load one OR the other, but not both.

My intention is that I will load the table initially (folder a), then consume data changes (from folder b) as they arrive and apply_changes to that table I've loaded from folder a. So one target table with two source folders.

What is the recommendation for approaching this, what would be a good ingestion pattern for something like this?

bblakey · ‎09-07-2022

Kaniz, thank you for the response. Perhaps this can help, need to do more reading on ThreadPoolExecutor for Spark. The other "minor" issue I did not mention is that the files in each folder have a few mutually-exclusive metadata columns that I either exclude/omit or synthesize by including with a "withColumn". The scenario I'm trying to accommodate is the D365 Export to Data Lake which seems like it should be straight-forward but is not really.

Recommendations for loading table from two different folder paths using Autoloader and DLT