Databricks

bblakey · ‎08-23-2022

I have a new (bronze) table that I want to write to - the initial table load (refresh) csv file is placed in folder a, the incremental changes (inserts/updates/deletes) csv files are placed in folder b. I've written a notebook that can load one OR the other, but not both.

My intention is that I will load the table initially (folder a), then consume data changes (from folder b) as they arrive and apply_changes to that table I've loaded from folder a. So one target table with two source folders.

What is the recommendation for approaching this, what would be a good ingestion pattern for something like this?

Kaniz · ‎09-05-2022

Hi @Bill Blakey, Does this S.O thread helps you to find your solution?

bblakey · ‎09-07-2022

Kaniz, thank you for the response. Perhaps this can help, need to do more reading on ThreadPoolExecutor for Spark. The other "minor" issue I did not mention is that the files in each folder have a few mutually-exclusive metadata columns that I either exclude/omit or synthesize by including with a "withColumn". The scenario I'm trying to accommodate is the D365 Export to Data Lake which seems like it should be straight-forward but is not really.

Kaniz · ‎10-10-2022

Hi @Bill Blakey, Thank you for your response.

Databricks

Recommendations for loading table from two different folder paths using Autoloader and DLT

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs