โ07-29-2024 03:09 AM
My task is to sync 300 tables from on prem sql server to delta lake.
I will load CDC from Raw. First step is to move CDC data to bronze with autoloader. Then using delta stream get changes from bronze, make simple datatype changes and merge this dataset to silver.
Previously I had just 20 tables, so I made 20 notebooks. But now with 300 tables seems to me little bit too many notebooks.
I was thinking to make one notebook with loop through all tables and skip datatype changes. This approach would not be running in parallel in databricks job, right?
Is there any way to generate notebooks based on some metadata (table name, schema)?
Is it a good idea to run job where is 300 notebooks? Number of records from CDC will not be in millions, but let's say in thousands.
Thanks.
โ07-29-2024 03:19 AM - edited โ07-29-2024 03:20 AM
Hi @alesventus ,
You can apply metadata/config driven approch. You can create control table (or json/yaml file) with all information that are required for processing like:
- table name
- target table
- table primary keys
- transformation to apply
And then loop on each table and pass parameters to generic notebook.
โ07-29-2024 03:19 AM - edited โ07-29-2024 03:20 AM
Hi @alesventus ,
You can apply metadata/config driven approch. You can create control table (or json/yaml file) with all information that are required for processing like:
- table name
- target table
- table primary keys
- transformation to apply
And then loop on each table and pass parameters to generic notebook.
โ08-06-2024 04:51 AM
I followed your advice, and it worked pretty well. I have generic notebook which accepts parameters as table name and primary keys.
In databricks with CLI command databricks jobs create --json {...} I can create job with hundreds of tasks in one shot. Skeleton of the jobs is very similar and tasks I generate in excel.
Thanks for idea.
โ01-29-2026 09:52 PM
Hi
Can you please share the step by step approach for the above steps with sample use case and with code that can be implemented in call the layers (bronze/silver/gold)
โ01-29-2026 10:44 PM
Hi Alesventus,
The approach suggested by syzmon_dybczak is definitely the same direction I would go in. To supplement their answer, have a look at this Databricks Labs repository that you may find useful as inspiration (DLT-Meta): https://github.com/databrickslabs/dlt-meta.
I have referenced this multiple times in the past and whilst this specific repository specifically looking at Lakeflow Declarative Pipelines, the patterns presented are agnostic to the tool. I think you may find some reusable patterns in the structure of the suggested control table structure and how you may find that they interact with your specific use case.