Databricks Community

ChristianRRL · ‎12-13-2023

Good day,

I am trying to use Auto Loader (potentially extending into DLT in the future) to easily pull data coming from an external system (currently located in a single location) and organize it and load it respectively. I am struggling quite a bit at the moment and I would really appreciate some feedback. Please let me know if any of my assumptions or approach towards this is not correct.

Here is an idealized version of what we want and are intending to do:

We are expecting data from multiple different kinds of sites/databases/tables to land in a single "dropzone" in our data lake
As soon as data files land in the dropzone, we would like for the data to be sorted into it's respective landing location (aka: bronze or raw location) organized appropriately by the kind of data it is
Once in the landing or bronze location, we should be able to appropriately use Auto Loader (and/or DLT) easily process our data in accordance to the "Medallion Architecture" model

The struggle we have currently is that once the data lands in the "dropzone", I can't see an effective/lean way to programmatically/automatically move the data from the dropzone to it's intended location. I've tried looking into data moving methods, but I am not able to find anything that is as straightforward as I thought this should be.

I thought that Auto Loader should be an easy way to pull data from various databases that is located in a centralized location. However, the more I look into it the more it seems like Auto Loader just "assumes" that the data will be located in the "landing" location (aka: bronze, aka: raw). If this is the case, the value of Auto Loader to me is greatly diminished.

Guidance and feedback on this would be *greatly* appreciated!

Below is a brief "sample" of what we're thinking currently:

# dropzone template
    .../<data_source_name>/dropzone/<full_file_name>.csv
---
# dropzone example
    .../source_x/dropzone/Database_A.Schema.Table_A.site_number_1.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_B.site_number_1.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_C.site_number_1.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_A.site_number_2.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_B.site_number_2.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_C.site_number_2.oem_shortname.timestamp.csv
    ...
    .../source_x/dropzone/Database_X.Schema.Table_A.site_number_x.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_X.Schema.Table_B.site_number_x.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_X.Schema.Table_C.site_number_x.oem_shortname.timestamp.csv
---
# bronze template
.../<data_source_name>/<category>/bronze/<oem_shortname>/<site_number>/<linted_database>/<linted_table_name>/<full_file_name>.csv
---
# bronze example
.../source_x/electrical/bronze/ge/1/database_a/table_a/<full_file_name>.csv
    .../source_x/electrical/bronze/ge/1/database_a/table_b/<full_file_name>.csv
    .../source_x/electrical/bronze/ge/1/database_a/table_c/<full_file_name>.csv
    .../source_x/electrical/bronze/siemens/2/database_b/table_a/<full_file_name>.csv
    .../source_x/electrical/bronze/siemens/2/database_b/table_b/<full_file_name>.csv
    .../source_x/electrical/bronze/siemens/2/database_b/table_c/<full_file_name>.csv
    .../source_x/mechanical/bronze/ge/3/database_c/table_a/<full_file_name>.csv
    .../source_x/mechanical/bronze/ge/3/database_c/table_b/<full_file_name>.csv
    .../source_x/mechanical/bronze/ge/3/database_c/Table_C/<full_file_name>.csv

.../source_x/mechanical/bronze/siemens/4/database_d/table_a/<full_file_name>.csv
    .../source_x/mechanical/bronze/siemens/4/database_d/table_c/<full_file_name>.csv
    .../source_x/mechanical/bronze/siemens/4/database_d/table_d/<full_file_name>.csv

ChristianRRL · ‎12-14-2023

I've actually looked through those sources somewhat extensively, but I'm still having some confusion about whether Auto Loader is the right solution. What I was hoping is that data can land in the "dropzone", but afterwards it is *moved* to the correct corresponding path as I showed earlier. For example (for a single file):

# dropzone example
.../source_x/dropzone/Database_A.Schema.Table_A.site_number_1.oem_shortname.timestamp.csv

# bronze example
.../source_x/electrical/bronze/ge/1/database_a/table_a/<full_file_name>.csv

However, Auto Loader seems to be more about highlighting one specific path. The issue with that is if I select the dropzone as my bronze-level path, there is a mix of many different raw csv files related to different tables and from different databases at times too. So does Auto Loader specifically support *moving* files from one centralized dbfs location (dropzone in this case) to the respective bronze delta table locations?

ChristianRRL · ‎12-18-2023

Quick follow-up on this @Retired_mod (or to anyone else in the Databricks multi-verse who is able to help clarify this case).

I understand that the proposed solution would work for a "one-to-one" case where many files are landing in a specific dbfs path to be ingested by auto loader into a specific bronze table. OR another similar case where we have "many-to-one" where we are expecting different data from different sources going into a single bronze table, but critically all the different sourced data would have the same format (i.e. all different source data have the same column length/type/order and it's cleaned up in a single bronze raw location).

In my case, what I'm asking about is if auto loader supports a "many-to-many" case (i.e. "many source database_table.csv-to-many target database_table_bronze"), or else what is the suggested Databricks approach to this? In this context, I mean that we have many source locations with many different kinds of tables being loaded into a single "dropzone" location. From this single "dropzone" location, I would expect that either auto loader should natively support my use-case, or if not that there should be a fairly standard approach to dealing with this.

I'm not the only person in the world dealing with this, and I found a similar example of someone else posting this on Stack Overflow: https://stackoverflow.com/questions/69572265/ingest-several-types-of-csvs-with-databricks-auto-loade.... The proposed solution for them was to use "pathGlobFilter", however I believe this only handles the first two cases and not the "many-to-many" case I'm posting here.

For additional context, here's the code I have so far below:

Databricks Community

Auto Loader Use Case Question - Centralized Dropzone to Bronze?

Connect with Databricks Users in Your Area

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs

Data + AI Summit: Call for Presentations

Season's Speedings: Databricks SQL Delivers 4x Performance Boost Over Two Years

Now Hiring: Databricks Community Technical Moderator

Become Our Next Monthly Community Champion!