cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Auto Loader Use Case Question - Centralized Dropzone to Bronze?

ChristianRRL
Valued Contributor

Good day,

I am trying to use Auto Loader (potentially extending into DLT in the future) to easily pull data coming from an external system (currently located in a single location) and organize it and load it respectively. I am struggling quite a bit at the moment and I would really appreciate some feedback. Please let me know if any of my assumptions or approach towards this is not correct.

Here is an idealized version of what we want and are intending to do:

  1. We are expecting data from multiple different kinds of sites/databases/tables to land in a single "dropzone" in our data lake
  2. As soon as data files land in the dropzone, we would like for the data to be sorted into it's respective landing location (aka: bronze or raw location) organized appropriately by the kind of data it is
  3. Once in the landing or bronze location, we should be able to appropriately use Auto Loader (and/or DLT) easily process our data in accordance to the "Medallion Architecture" model

 

The struggle we have currently is that once the data lands in the "dropzone", I can't see an effective/lean way to programmatically/automatically move the data from the dropzone to it's intended location. I've tried looking into data moving methods, but I am not able to find anything that is as straightforward as I thought this should be.

I thought that Auto Loader should be an easy way to pull data from various databases that is located in a centralized location. However, the more I look into it the more it seems like Auto Loader just "assumes" that the data will be located in the "landing" location (aka: bronze, aka: raw). If this is the case, the value of Auto Loader to me is greatly diminished.

Guidance and feedback on this would be *greatly* appreciated!

 

Below is a brief "sample" of what we're thinking currently:

# dropzone template
    .../<data_source_name>/dropzone/<full_file_name>.csv
---
# dropzone example
    .../source_x/dropzone/Database_A.Schema.Table_A.site_number_1.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_B.site_number_1.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_C.site_number_1.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_A.site_number_2.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_B.site_number_2.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_C.site_number_2.oem_shortname.timestamp.csv
    ...
    .../source_x/dropzone/Database_X.Schema.Table_A.site_number_x.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_X.Schema.Table_B.site_number_x.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_X.Schema.Table_C.site_number_x.oem_shortname.timestamp.csv
---
# bronze template
.../<data_source_name>/<category>/bronze/<oem_shortname>/<site_number>/<linted_database>/<linted_table_name>/<full_file_name>.csv
---
# bronze example
    .../source_x/electrical/bronze/ge/1/database_a/table_a/<full_file_name>.csv
    .../source_x/electrical/bronze/ge/1/database_a/table_b/<full_file_name>.csv
    .../source_x/electrical/bronze/ge/1/database_a/table_c/<full_file_name>.csv
    .../source_x/electrical/bronze/siemens/2/database_b/table_a/<full_file_name>.csv
    .../source_x/electrical/bronze/siemens/2/database_b/table_b/<full_file_name>.csv
    .../source_x/electrical/bronze/siemens/2/database_b/table_c/<full_file_name>.csv
    .../source_x/mechanical/bronze/ge/3/database_c/table_a/<full_file_name>.csv
    .../source_x/mechanical/bronze/ge/3/database_c/table_b/<full_file_name>.csv
    .../source_x/mechanical/bronze/ge/3/database_c/Table_C/<full_file_name>.csv

   .../source_x/mechanical/bronze/siemens/4/database_d/table_a/<full_file_name>.csv
    .../source_x/mechanical/bronze/siemens/4/database_d/table_c/<full_file_name>.csv
    .../source_x/mechanical/bronze/siemens/4/database_d/table_d/<full_file_name>.csv

2 REPLIES 2

I've actually looked through those sources somewhat extensively, but I'm still having some confusion about whether Auto Loader is the right solution. What I was hoping is that data can land in the "dropzone", but afterwards it is *moved* to the correct corresponding path as I showed earlier. For example (for a single file):

# dropzone example
    .../source_x/dropzone/Database_A.Schema.Table_A.site_number_1.oem_shortname.timestamp.csv

# bronze example
    .../source_x/electrical/bronze/ge/1/database_a/table_a/<full_file_name>.csv

However, Auto Loader seems to be more about highlighting one specific path. The issue with that is if I select the dropzone as my bronze-level path, there is a mix of many different raw csv files related to different tables and from different databases at times too. So does Auto Loader specifically support *moving* files from one centralized dbfs location (dropzone in this case) to the respective bronze delta table locations?

ChristianRRL
Valued Contributor

Quick follow-up on this @Retired_mod (or to anyone else in the Databricks multi-verse who is able to help clarify this case).

I understand that the proposed solution would work for a "one-to-one" case where many files are landing in a specific dbfs path to be ingested by auto loader into a specific bronze table. OR another similar case where we have "many-to-one" where we are expecting different data from different sources going into a single bronze table, but critically all the different sourced data would have the same format (i.e. all different source data have the same column length/type/order and it's cleaned up in a single bronze raw location).

In my case, what I'm asking about is if auto loader supports a "many-to-many" case (i.e. "many source database_table.csv-to-many target database_table_bronze"), or else what is the suggested Databricks approach to this? In this context, I mean that we have many source locations with many different kinds of tables being loaded into a single "dropzone" location. From this single "dropzone" location, I would expect that either auto loader should natively support my use-case, or if not that there should be a fairly standard approach to dealing with this.

I'm not the only person in the world dealing with this, and I found a similar example of someone else posting this on Stack Overflow: https://stackoverflow.com/questions/69572265/ingest-several-types-of-csvs-with-databricks-auto-loade.... The proposed solution for them was to use "pathGlobFilter", however I believe this only handles the first two cases and not the "many-to-many" case I'm posting here.

For additional context, here's the code I have so far below:

ChristianRRL_1-1702920099476.png

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group