Hi @HitMah ,
Databricks recommendations is to use medallion architecture. A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze โ Silver โ Gold layer tables).
So we have 3 layers:
- The Bronze layer is where we land all the data from external source systems. We should not apply any transformation to source data at this layer except adding some metadata like load time, process id etc. The main purpose of this layer is to have an historical archive of source and also to have an ability of reprocessing if needed without rereading the data from the source system.
In the Silver layer of the lakehouse, data from the Bronze layer is cleansed,deduplicated and merged. Silver layer provide an "Enterprise view" (e.g. master customers, stores, non-duplicated transactions and cross-reference tables).
The Gold layer of the lakehouse is designed for business users and contains core business logic. We define the final layer of data transformations and data quality rules are applied here. So, we see a lot of Kimball style star schema-based data models or Inmon style Data marts fit in this Gold Layer of the lakehouse.
In the lakehouse data engineering paradigm, typically the ELT methodology is followed (instead of classical ETL). You can read about the difference between those 2 approaches at below articles:
Extract Transform Load (ETL) | Databricks
ETL vs ELT - Difference Between Data-Processing Approaches - AWS
And when you're migrating existing EDW to Lakehouse you can use various tools and accelerators to make this process easier. You can leverge:
- Lakeflow Connect which supports the ingestion of data coming from legacy database systems but also by ingesting new data coming from modern Saas platforms like Salesforce or Workday.
- Lakehouse Federation to easily and quickly replicate data marts coming from the legacy data warehouse to Databricks
- BladeBridge to support automatic code conversion coming from the legacy (e.g., Teradata BTEQ, Oracle stored procedures, Third party ETL pipelines).
- Databricks Workflows to support complex pipelines
- Lakeflow Declartive Pipelines which simplifies batch and streaming ETL with automated reliability and built-in data quality.
If you have any other question feel free to ask ๐