cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best practices for the meta data driven ETL framework

Swathik
New Contributor III

I am designing a metadata‑driven ETL framework to migrate approximately 500 tables from Db2 to PostgreSQL.After reviewing multiple design patterns and blog posts, I am uncertain about the recommended approach for storing ETL metadata such as source schema, target schema, transformation logic, and test cases. Should this metadata be stored in database tables or in external configuration files such as YAML?In addition, what is the recommended orchestration pattern for this type of migration? Is it preferable to have a dedicated workflow or pipeline per target gold table, or a more generic workflow that loops over a metadata list of 500 tables and processes them in a parameterized way?

3 ACCEPTED SOLUTIONS

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @Swathik ,

If we're talking about where to store ETL metadata, in my opinion it's mostly a matter of preference. In my case, I prefer storing my config in YAML files, but I’ve also worked on projects where the config was stored in Delta tables.

For instance, here you can find approach that uses config tables and it's also valid:

Metadata-Driven ETL Framework in Databricks (Part-... - Databricks Community - 92666

Regarding orchestration, I prefer to automate both the bronze and silver layers. So I typically have one generic bronze-layer notebook and one generic silver-layer notebook. Then I iterate over my ETL metadata stored in the YAML config file and load each Delta table using a foreach task in Databricks Workflows.

For gold layer I want to be explicit because usually you have different business logic for your fact and dimension tables. So here I implement independent notebook for each business entity.

View solution in original post

nayan_wylde
Esteemed Contributor

For a migration of that scale, I’d lean toward storing metadata in database tables rather than YAML files. It’s easier to query, update, and integrate with orchestration tools, especially when you have 500 tables. YAML works fine for small projects, but it gets messy fast at enterprise scale.
For orchestration, a generic parameterized workflow is usually the way to go. Instead of creating 500 separate pipelines, have one dynamic pipeline that loops through your metadata and processes each table. You can still parallelize tasks for performance, but this approach keeps things maintainable and flexible.

View solution in original post

szymon_dybczak
Esteemed Contributor III

Hi @Swathik ,

If one task fails within foreach loop you don't have to run all of them again. Databricks workflows supports repair behaviour and only failed ones will rerun.

Also, since we are talking about metadata driven framework you can add ability to disable tables using boolean flag in control table ( or in yaml config file)

 

View solution in original post

5 REPLIES 5

szymon_dybczak
Esteemed Contributor III

Hi @Swathik ,

If we're talking about where to store ETL metadata, in my opinion it's mostly a matter of preference. In my case, I prefer storing my config in YAML files, but I’ve also worked on projects where the config was stored in Delta tables.

For instance, here you can find approach that uses config tables and it's also valid:

Metadata-Driven ETL Framework in Databricks (Part-... - Databricks Community - 92666

Regarding orchestration, I prefer to automate both the bronze and silver layers. So I typically have one generic bronze-layer notebook and one generic silver-layer notebook. Then I iterate over my ETL metadata stored in the YAML config file and load each Delta table using a foreach task in Databricks Workflows.

For gold layer I want to be explicit because usually you have different business logic for your fact and dimension tables. So here I implement independent notebook for each business entity.

Ok but is it still suggested to have a for each loop even for 500 tables?? 

Also if one of the table fails in the loop I think it would be difficult to go over all the 500 tables.

szymon_dybczak
Esteemed Contributor III

Hi @Swathik ,

If one task fails within foreach loop you don't have to run all of them again. Databricks workflows supports repair behaviour and only failed ones will rerun.

Also, since we are talking about metadata driven framework you can add ability to disable tables using boolean flag in control table ( or in yaml config file)

 

nayan_wylde
Esteemed Contributor

For a migration of that scale, I’d lean toward storing metadata in database tables rather than YAML files. It’s easier to query, update, and integrate with orchestration tools, especially when you have 500 tables. YAML works fine for small projects, but it gets messy fast at enterprise scale.
For orchestration, a generic parameterized workflow is usually the way to go. Instead of creating 500 separate pipelines, have one dynamic pipeline that loops through your metadata and processes each table. You can still parallelize tasks for performance, but this approach keeps things maintainable and flexible.

Swathik
New Contributor III

Thank you. This helps a lot