cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Auto Loader Use Case Question - Centralized Dropzone to Bronze?

ChristianRRL
Contributor II

Good day,

I am trying to use Auto Loader (potentially extending into DLT in the future) to easily pull data coming from an external system (currently located in a single location) and organize it and load it respectively. I am struggling quite a bit at the moment and I would really appreciate some feedback. Please let me know if any of my assumptions or approach towards this is not correct.

Here is an idealized version of what we want and are intending to do:

  1. We are expecting data from multiple different kinds of sites/databases/tables to land in a single "dropzone" in our data lake
  2. As soon as data files land in the dropzone, we would like for the data to be sorted into it's respective landing location (aka: bronze or raw location) organized appropriately by the kind of data it is
  3. Once in the landing or bronze location, we should be able to appropriately use Auto Loader (and/or DLT) easily process our data in accordance to the "Medallion Architecture" model

 

The struggle we have currently is that once the data lands in the "dropzone", I can't see an effective/lean way to programmatically/automatically move the data from the dropzone to it's intended location. I've tried looking into data moving methods, but I am not able to find anything that is as straightforward as I thought this should be.

I thought that Auto Loader should be an easy way to pull data from various databases that is located in a centralized location. However, the more I look into it the more it seems like Auto Loader just "assumes" that the data will be located in the "landing" location (aka: bronze, aka: raw). If this is the case, the value of Auto Loader to me is greatly diminished.

Guidance and feedback on this would be *greatly* appreciated!

 

Below is a brief "sample" of what we're thinking currently:

# dropzone template
    .../<data_source_name>/dropzone/<full_file_name>.csv
---
# dropzone example
    .../source_x/dropzone/Database_A.Schema.Table_A.site_number_1.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_B.site_number_1.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_C.site_number_1.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_A.site_number_2.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_B.site_number_2.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_A.Schema.Table_C.site_number_2.oem_shortname.timestamp.csv
    ...
    .../source_x/dropzone/Database_X.Schema.Table_A.site_number_x.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_X.Schema.Table_B.site_number_x.oem_shortname.timestamp.csv
    .../source_x/dropzone/Database_X.Schema.Table_C.site_number_x.oem_shortname.timestamp.csv
---
# bronze template
.../<data_source_name>/<category>/bronze/<oem_shortname>/<site_number>/<linted_database>/<linted_table_name>/<full_file_name>.csv
---
# bronze example
    .../source_x/electrical/bronze/ge/1/database_a/table_a/<full_file_name>.csv
    .../source_x/electrical/bronze/ge/1/database_a/table_b/<full_file_name>.csv
    .../source_x/electrical/bronze/ge/1/database_a/table_c/<full_file_name>.csv
    .../source_x/electrical/bronze/siemens/2/database_b/table_a/<full_file_name>.csv
    .../source_x/electrical/bronze/siemens/2/database_b/table_b/<full_file_name>.csv
    .../source_x/electrical/bronze/siemens/2/database_b/table_c/<full_file_name>.csv
    .../source_x/mechanical/bronze/ge/3/database_c/table_a/<full_file_name>.csv
    .../source_x/mechanical/bronze/ge/3/database_c/table_b/<full_file_name>.csv
    .../source_x/mechanical/bronze/ge/3/database_c/Table_C/<full_file_name>.csv

   .../source_x/mechanical/bronze/siemens/4/database_d/table_a/<full_file_name>.csv
    .../source_x/mechanical/bronze/siemens/4/database_d/table_c/<full_file_name>.csv
    .../source_x/mechanical/bronze/siemens/4/database_d/table_d/<full_file_name>.csv

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @ChristianRRL,

 

  • Auto Loader is a powerful feature in Databricks for data ingestion.
  • It automatically processes new files in cloud storage.
  • Supported cloud storage services include AWS S3, Azure data lake Storage Gen2, Google Cloud Storage, and more.
  • Auto Loader ensures exactly once processing.
  • Your sample-path structure aligns well with Auto Loader.

Solution:

  1. Configuration:
    • Set up Auto Loader to monitor the dropzone directory.
    • Configure schema inference and evolution if needed.
  2. Testing and Validation:
    • Test the setup with sample data.
    • Verify that files are correctly loaded into the landing location.
  3. Explore DLT:
    • If you plan to extend into DLT, explore how it can enhance your data processing workflows.

Remember that Auto Loader simplifies data ingestion, allowing you to focus on processing and deriving insights from your data. Good luck with your data pipeline! 🌊🚀

 

Source:-

  1.  https://docs.databricks.com/en/ingestion/auto-loader/schema.html
  2. https://docs.databricks.com/en/ingestion/auto-loader/index.html
  3. https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/
     

 

 

I've actually looked through those sources somewhat extensively, but I'm still having some confusion about whether Auto Loader is the right solution. What I was hoping is that data can land in the "dropzone", but afterwards it is *moved* to the correct corresponding path as I showed earlier. For example (for a single file):

# dropzone example
    .../source_x/dropzone/Database_A.Schema.Table_A.site_number_1.oem_shortname.timestamp.csv

# bronze example
    .../source_x/electrical/bronze/ge/1/database_a/table_a/<full_file_name>.csv

However, Auto Loader seems to be more about highlighting one specific path. The issue with that is if I select the dropzone as my bronze-level path, there is a mix of many different raw csv files related to different tables and from different databases at times too. So does Auto Loader specifically support *moving* files from one centralized dbfs location (dropzone in this case) to the respective bronze delta table locations?

ChristianRRL
Contributor II

Quick follow-up on this @Kaniz_Fatma (or to anyone else in the Databricks multi-verse who is able to help clarify this case).

I understand that the proposed solution would work for a "one-to-one" case where many files are landing in a specific dbfs path to be ingested by auto loader into a specific bronze table. OR another similar case where we have "many-to-one" where we are expecting different data from different sources going into a single bronze table, but critically all the different sourced data would have the same format (i.e. all different source data have the same column length/type/order and it's cleaned up in a single bronze raw location).

In my case, what I'm asking about is if auto loader supports a "many-to-many" case (i.e. "many source database_table.csv-to-many target database_table_bronze"), or else what is the suggested Databricks approach to this? In this context, I mean that we have many source locations with many different kinds of tables being loaded into a single "dropzone" location. From this single "dropzone" location, I would expect that either auto loader should natively support my use-case, or if not that there should be a fairly standard approach to dealing with this.

I'm not the only person in the world dealing with this, and I found a similar example of someone else posting this on Stack Overflow: https://stackoverflow.com/questions/69572265/ingest-several-types-of-csvs-with-databricks-auto-loade.... The proposed solution for them was to use "pathGlobFilter", however I believe this only handles the first two cases and not the "many-to-many" case I'm posting here.

For additional context, here's the code I have so far below:

ChristianRRL_1-1702920099476.png

 

 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!