Databricks Community

DB_Learner_3 · yesterday

Description:
As a data engineer, I need to implement an automated process to ingest data from multiple files in a subdirectory and create corresponding bronze tables. This process should handle full file refreshes and consider strategies to limit the growth of the bronze streaming tables.

Acceptance Criteria:

Identify Files: The process can identify all files in a specified subdirectory, handling both CSV
Build Table Definitions: The process can automatically generate table creation SQL statements based on the schema inferred from the input files.
Implement Bronze Table Ingestion: The process can ingest data from each file into a corresponding bronze table, handling full file refreshes (streaming then Materialized views)
Optimize Bronze Table Growth: The process includes a strategy to limit the growth of the bronze tables, such as materialized views, bronze table with truncate/merge, or partitioning.
Provide Reusable and Maintainable Code: The ingestion process is implemented as a reusable Python script Can someone help on this

BigRoux · 7 hours ago

What do you mean by "full file refreshes"? Does this refer to the fact that file names will be reused?

Databricks Community

Help needed with automated file ingestion process for full refresh mode using autoloader preferable

Connect with Databricks Users in Your Area

What’s New With Databricks Assistant?

Databricks Community Champion - October 2024 - Filip Niziol

Become Our Next Monthly Community Champion!

Introducing Simple, Fast, and Scalable Batch LLM Inference on Mosaic AI Model Serving

Databricks Migration Strategy: Lessons Learned