Databricks Community

DB_Learner_3 · ‎11-05-2024

Description:
As a data engineer, I need to implement an automated process to ingest data from multiple files in a subdirectory and create corresponding bronze tables. This process should handle full file refreshes and consider strategies to limit the growth of the bronze streaming tables.

Acceptance Criteria:

Identify Files: The process can identify all files in a specified subdirectory, handling both CSV
Build Table Definitions: The process can automatically generate table creation SQL statements based on the schema inferred from the input files.
Implement Bronze Table Ingestion: The process can ingest data from each file into a corresponding bronze table, handling full file refreshes (streaming then Materialized views)
Optimize Bronze Table Growth: The process includes a strategy to limit the growth of the bronze tables, such as materialized views, bronze table with truncate/merge, or partitioning.
Provide Reusable and Maintainable Code: The ingestion process is implemented as a reusable Python script Can someone help on this

BigRoux · ‎11-06-2024

What do you mean by "full file refreshes"? Does this refer to the fact that file names will be reused?

DB_Learner_3 · ‎11-07-2024

We are trying an ingestion process to bronze table from a volume, and in that volume if multiple CSV files are there it need to loop through each file as a full refresh mode (truncate the old file and load the new file) , if we are using autoloader option currently its appending the data each time it reads the new file instead load the new data,

BigRoux · ‎11-07-2024

Autoloader will keep an inventory of the files it already loaded and not re-load them, this is the default behavior. You can overwride this behavior by enabling "cloudFiels.allowOverwrites" and autoloader will re-ingest the files based on file modificaiton time. Keep in mind that autoloader cannot detect records already loaded, it will reload the same data twice if you implement the "cloudFiles.allowOverwrites". It is our recommended best practice, when loaded files into cloud storage, to create new files with new names.