โ11-05-2024 11:01 AM
Description:
As a data engineer, I need to implement an automated process to ingest data from multiple files in a subdirectory and create corresponding bronze tables. This process should handle full file refreshes and consider strategies to limit the growth of the bronze streaming tables.
Acceptance Criteria:
Identify Files: The process can identify all files in a specified subdirectory, handling both CSV
Build Table Definitions: The process can automatically generate table creation SQL statements based on the schema inferred from the input files.
Implement Bronze Table Ingestion: The process can ingest data from each file into a corresponding bronze table, handling full file refreshes (streaming then Materialized views)
Optimize Bronze Table Growth: The process includes a strategy to limit the growth of the bronze tables, such as materialized views, bronze table with truncate/merge, or partitioning.
Provide Reusable and Maintainable Code: The ingestion process is implemented as a reusable Python script Can someone help on this
โ11-06-2024 12:35 PM
What do you mean by "full file refreshes"? Does this refer to the fact that file names will be reused?
โ11-07-2024 06:11 AM
We are trying an ingestion process to bronze table from a volume, and in that volume if multiple CSV files are there it need to loop through each file as a full refresh mode (truncate the old file and load the new file) , if we are using autoloader option currently its appending the data each time it reads the new file instead load the new data,
โ11-07-2024 06:29 AM
Autoloader will keep an inventory of the files it already loaded and not re-load them, this is the default behavior. You can overwride this behavior by enabling "cloudFiels.allowOverwrites" and autoloader will re-ingest the files based on file modificaiton time. Keep in mind that autoloader cannot detect records already loaded, it will reload the same data twice if you implement the "cloudFiles.allowOverwrites". It is our recommended best practice, when loaded files into cloud storage, to create new files with new names.
โ11-07-2024 06:59 AM
Thank you will try that; this is how the files are named and directory by each source.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group