I’m working on a project to practice using Databricks, Lakehouse architecture, Medallion architecture, and data lakes.
Setup:
I have an Azure Data Lake with three containers: bronze, silver, and gold.
In the bronze container, I store daily customer CSV files at paths like: bronze/adventureworks/year=2025/month=5/day=25/customer.csv
Every day, a new customer file is saved under a new path using Azure Data Factory.
In Databricks, I’m using Unity Catalog. I created a development catalog and a schema named adventureworks.
I want to create tables named bronze_customer, silver_customer, and gold_customer in their respective layers.
My question: How should I create the bronze_customer table in Databricks to efficiently handle these daily files? Specifically:
How do I create the table in Unity Catalog to include all daily partitions?
What is the recommended approach for managing full loads (replacing all data daily) versus incremental loads (appending only new or changed data) in this setup?
I’m looking for best practices or example code for table creation and data ingestion in Databricks using Unity Catalog with daily data stored in the data lake