cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Transitioning Approach for Evolving EDW

HitMah
New Contributor

As EDW will continue to evolve with new data and business logic during the multi-phased migration, what architectural strategies and design patterns can minimize rework when migrating from an evolving Enterprise Data Warehouse (EDW) to Databricks?

I read ETL-First approach but it's not much documented and not many videos explaining in details by Databricks. 

3 REPLIES 3

szymon_dybczak
Esteemed Contributor III

Hi @HitMah ,

Databricks recommendations is to use medallion architecture. A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze โ‡’ Silver โ‡’ Gold layer tables). 

So we have 3 layers:

  • The Bronze layer is where we land all the data from external source systems. We should not apply any transformation to source data at this layer except adding some metadata like load time, process id etc. The main purpose of this layer is to have an historical archive of source and also to have an ability of reprocessing if needed without rereading the data from the source system.
  • In the Silver layer of the lakehouse, data from the Bronze layer is cleansed,deduplicated and merged. Silver layer provide an "Enterprise view"  (e.g. master customers, stores, non-duplicated transactions and cross-reference tables).

  • The Gold layer of the lakehouse is designed for business users and contains core business logic. We define the final layer of data transformations and data quality rules are applied here. So, we see a lot of Kimball style star schema-based data models or Inmon style Data marts fit in this Gold Layer of the lakehouse.

In the lakehouse data engineering paradigm, typically the ELT methodology is followed (instead of classical ETL). You can read about the difference between those 2 approaches at below articles:

Extract Transform Load (ETL) | Databricks

ETL vs ELT - Difference Between Data-Processing Approaches - AWS

And when you're migrating existing EDW to Lakehouse you can use various tools and accelerators to make this process easier. You can leverge:

- Lakeflow Connect which supports the ingestion of data coming from legacy database systems but also by ingesting new data coming from modern Saas platforms like Salesforce or Workday.

- Lakehouse Federation to easily and quickly replicate data marts coming from the legacy data warehouse to Databricks

BladeBridge to support automatic code conversion coming from the legacy (e.g., Teradata BTEQ, Oracle stored procedures, Third party ETL pipelines).

Databricks Workflows to support complex pipelines

- Lakeflow Declartive Pipelines which simplifies batch and streaming ETL with automated reliability and built-in data quality.

If you have any other question feel free to ask ๐Ÿ™‚

Thank you for your response.

Iโ€™m familiar with the Medallion Architecture concept in Databricks. My question is more focused on how to effectively run our existing Enterprise Data Warehouse (EDW) and Databricks in parallel during the migration phase.

For example, we currently need to develop a new data mart in the EDW, but our Databricks platform is still being built โ€” it may take another 6+ months before itโ€™s production-ready, and potentially another year before it can fully meet the data mart requirements.

Given this overlap, what would be the best architectural approach or recommended practices to minimize rework when we eventually migrate this new data mart to Databricks?

Thanks

 

 

 

 

-werners-
Esteemed Contributor III

There is no single approach. It depends on your organization.
First you have the 'impact-axis' which is lean and mean vs big bang.
Next you also have bottom-up (first fix bronze and work upwards) or top-down (focus on gold and read data from your legacy EDW using connections).
So a lot is possible.
I am always a fan on focusing on a single stream/cube/datamart and start migrating that.
That is more a leand and mean/bottom up approach, which i prefer.  But you might think otherwise.

Just don't think that your Databricks Lakehouse will not change.  With planning you can limit the amount of reengineering but not avoid it. Basically it is the same as a EDW but way less strict, which has its pros and cons.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now