cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Databricks Academy Learners
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Help needed with automated file ingestion process for full refresh mode using autoloader preferable

DB_Learner_3
New Contributor II

Description:
As a data engineer, I need to implement an automated process to ingest data from multiple files in a subdirectory and create corresponding bronze tables. This process should handle full file refreshes and consider strategies to limit the growth of the bronze streaming tables.

Acceptance Criteria:

  1. Identify Files: The process can identify all files in a specified subdirectory, handling both CSV

  2. Build Table Definitions: The process can automatically generate table creation SQL statements based on the schema inferred from the input files.

  3. Implement Bronze Table Ingestion: The process can ingest data from each file into a corresponding bronze table, handling full file refreshes (streaming then Materialized views)

  4. Optimize Bronze Table Growth: The process includes a strategy to limit the growth of the bronze tables, such as materialized views, bronze table with truncate/merge, or partitioning.

  5. Provide Reusable and Maintainable Code: The ingestion process is implemented as a reusable Python script Can someone help on this

4 REPLIES 4

BigRoux
Databricks Employee
Databricks Employee

What do you mean by "full file refreshes"?  Does this refer to the fact that file names will be reused?

We are trying an ingestion process to bronze table from a volume, and in that volume if multiple CSV files are there it need to loop through each file as a full refresh mode (truncate the old file and load the new file) , if we are using autoloader option currently its appending the data each time it reads the new file instead load the new data,

BigRoux
Databricks Employee
Databricks Employee

Autoloader will keep an inventory of the files it already loaded and not re-load them, this is the default behavior.  You can overwride this behavior by enabling "cloudFiels.allowOverwrites" and autoloader will re-ingest the files based on file modificaiton time.  Keep in mind that autoloader cannot detect records already loaded, it will reload the same data twice if you implement the "cloudFiles.allowOverwrites".  It is our recommended best practice, when loaded files into cloud storage, to create new files with new names.

 

DB_Learner_3
New Contributor II

Thank you will try that; this is how the files are named and directory by each source.

DB_Learner_3_0-1730991495796.png

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group