Databricks

bblakey · ‎08-23-2022

I have a new (bronze) table that I want to write to - the initial table load (refresh) csv file is placed in folder a, the incremental changes (inserts/updates/deletes) csv files are placed in folder b. I've written a notebook that can load one OR the other, but not both.

My intention is that I will load the table initially (folder a), then consume data changes (from folder b) as they arrive and apply_changes to that table I've loaded from folder a. So one target table with two source folders.

What is the recommendation for approaching this, what would be a good ingestion pattern for something like this?

Kaniz · ‎09-05-2022

Hi @Bill Blakey, Does this S.O thread helps you to find your solution?

bblakey · ‎09-07-2022

Kaniz, thank you for the response. Perhaps this can help, need to do more reading on ThreadPoolExecutor for Spark. The other "minor" issue I did not mention is that the files in each folder have a few mutually-exclusive metadata columns that I either exclude/omit or synthesize by including with a "withColumn". The scenario I'm trying to accommodate is the D365 Export to Data Lake which seems like it should be straight-forward but is not really.

Kaniz · ‎10-10-2022

Hi @Bill Blakey, Thank you for your response.

Databricks

Recommendations for loading table from two different folder paths using Autoloader and DLT

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!