Good day,
I am trying to use Auto Loader (potentially extending into DLT in the future) to easily pull data coming from an external system (currently located in a single location) and organize it and load it respectively. I am struggling quite a bit at the moment and I would really appreciate some feedback. Please let me know if any of my assumptions or approach towards this is not correct.
Here is an idealized version of what we want and are intending to do:
- We are expecting data from multiple different kinds of sites/databases/tables to land in a single "dropzone" in our data lake
- As soon as data files land in the dropzone, we would like for the data to be sorted into it's respective landing location (aka: bronze or raw location) organized appropriately by the kind of data it is
- Once in the landing or bronze location, we should be able to appropriately use Auto Loader (and/or DLT) easily process our data in accordance to the "Medallion Architecture" model
The struggle we have currently is that once the data lands in the "dropzone", I can't see an effective/lean way to programmatically/automatically move the data from the dropzone to it's intended location. I've tried looking into data moving methods, but I am not able to find anything that is as straightforward as I thought this should be.
I thought that Auto Loader should be an easy way to pull data from various databases that is located in a centralized location. However, the more I look into it the more it seems like Auto Loader just "assumes" that the data will be located in the "landing" location (aka: bronze, aka: raw). If this is the case, the value of Auto Loader to me is greatly diminished.
Guidance and feedback on this would be *greatly* appreciated!
Below is a brief "sample" of what we're thinking currently:
# dropzone template
.../<data_source_name>/dropzone/<full_file_name>.csv
---
# dropzone example
.../source_x/dropzone/Database_A.Schema.Table_A.site_number_1.oem_shortname.timestamp.csv
.../source_x/dropzone/Database_A.Schema.Table_B.site_number_1.oem_shortname.timestamp.csv
.../source_x/dropzone/Database_A.Schema.Table_C.site_number_1.oem_shortname.timestamp.csv
.../source_x/dropzone/Database_A.Schema.Table_A.site_number_2.oem_shortname.timestamp.csv
.../source_x/dropzone/Database_A.Schema.Table_B.site_number_2.oem_shortname.timestamp.csv
.../source_x/dropzone/Database_A.Schema.Table_C.site_number_2.oem_shortname.timestamp.csv
...
.../source_x/dropzone/Database_X.Schema.Table_A.site_number_x.oem_shortname.timestamp.csv
.../source_x/dropzone/Database_X.Schema.Table_B.site_number_x.oem_shortname.timestamp.csv
.../source_x/dropzone/Database_X.Schema.Table_C.site_number_x.oem_shortname.timestamp.csv
---
# bronze template
.../<data_source_name>/<category>/bronze/<oem_shortname>/<site_number>/<linted_database>/<linted_table_name>/<full_file_name>.csv
---
# bronze example
.../source_x/electrical/bronze/ge/1/database_a/table_a/<full_file_name>.csv
.../source_x/electrical/bronze/ge/1/database_a/table_b/<full_file_name>.csv
.../source_x/electrical/bronze/ge/1/database_a/table_c/<full_file_name>.csv
.../source_x/electrical/bronze/siemens/2/database_b/table_a/<full_file_name>.csv
.../source_x/electrical/bronze/siemens/2/database_b/table_b/<full_file_name>.csv
.../source_x/electrical/bronze/siemens/2/database_b/table_c/<full_file_name>.csv
.../source_x/mechanical/bronze/ge/3/database_c/table_a/<full_file_name>.csv
.../source_x/mechanical/bronze/ge/3/database_c/table_b/<full_file_name>.csv
.../source_x/mechanical/bronze/ge/3/database_c/Table_C/<full_file_name>.csv
.../source_x/mechanical/bronze/siemens/4/database_d/table_a/<full_file_name>.csv
.../source_x/mechanical/bronze/siemens/4/database_d/table_c/<full_file_name>.csv
.../source_x/mechanical/bronze/siemens/4/database_d/table_d/<full_file_name>.csv