Sunday
I am designing a metadataโdriven ETL framework to migrate approximately 500 tables from Db2 to PostgreSQL.After reviewing multiple design patterns and blog posts, I am uncertain about the recommended approach for storing ETL metadata such as source schema, target schema, transformation logic, and test cases. Should this metadata be stored in database tables or in external configuration files such as YAML?In addition, what is the recommended orchestration pattern for this type of migration? Is it preferable to have a dedicated workflow or pipeline per target gold table, or a more generic workflow that loops over a metadata list of 500 tables and processes them in a parameterized way?
Sunday
Hi @Swathik ,
If we're talking about where to store ETL metadata, in my opinion it's mostly a matter of preference. In my case, I prefer storing my config in YAML files, but Iโve also worked on projects where the config was stored in Delta tables.
For instance, here you can find approach that uses config tables and it's also valid:
Metadata-Driven ETL Framework in Databricks (Part-... - Databricks Community - 92666
Regarding orchestration, I prefer to automate both the bronze and silver layers. So I typically have one generic bronze-layer notebook and one generic silver-layer notebook. Then I iterate over my ETL metadata stored in the YAML config file and load each Delta table using a foreach task in Databricks Workflows.
For gold layer I want to be explicit because usually you have different business logic for your fact and dimension tables. So here I implement independent notebook for each business entity.
yesterday
For a migration of that scale, Iโd lean toward storing metadata in database tables rather than YAML files. Itโs easier to query, update, and integrate with orchestration tools, especially when you have 500 tables. YAML works fine for small projects, but it gets messy fast at enterprise scale.
For orchestration, a generic parameterized workflow is usually the way to go. Instead of creating 500 separate pipelines, have one dynamic pipeline that loops through your metadata and processes each table. You can still parallelize tasks for performance, but this approach keeps things maintainable and flexible.
yesterday
Hi @Swathik ,
If one task fails within foreach loop you don't have to run all of them again. Databricks workflows supports repair behaviour and only failed ones will rerun.
Also, since we are talking about metadata driven framework you can add ability to disable tables using boolean flag in control table ( or in yaml config file)
Sunday
Hi @Swathik ,
If we're talking about where to store ETL metadata, in my opinion it's mostly a matter of preference. In my case, I prefer storing my config in YAML files, but Iโve also worked on projects where the config was stored in Delta tables.
For instance, here you can find approach that uses config tables and it's also valid:
Metadata-Driven ETL Framework in Databricks (Part-... - Databricks Community - 92666
Regarding orchestration, I prefer to automate both the bronze and silver layers. So I typically have one generic bronze-layer notebook and one generic silver-layer notebook. Then I iterate over my ETL metadata stored in the YAML config file and load each Delta table using a foreach task in Databricks Workflows.
For gold layer I want to be explicit because usually you have different business logic for your fact and dimension tables. So here I implement independent notebook for each business entity.
yesterday
Ok but is it still suggested to have a for each loop even for 500 tables??
Also if one of the table fails in the loop I think it would be difficult to go over all the 500 tables.
yesterday
Hi @Swathik ,
If one task fails within foreach loop you don't have to run all of them again. Databricks workflows supports repair behaviour and only failed ones will rerun.
Also, since we are talking about metadata driven framework you can add ability to disable tables using boolean flag in control table ( or in yaml config file)
yesterday
For a migration of that scale, Iโd lean toward storing metadata in database tables rather than YAML files. Itโs easier to query, update, and integrate with orchestration tools, especially when you have 500 tables. YAML works fine for small projects, but it gets messy fast at enterprise scale.
For orchestration, a generic parameterized workflow is usually the way to go. Instead of creating 500 separate pipelines, have one dynamic pipeline that loops through your metadata and processes each table. You can still parallelize tasks for performance, but this approach keeps things maintainable and flexible.
yesterday
Thank you. This helps a lot
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now