cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to handle load of 300 tables to delta lake

alesventus
Contributor

My task is to sync 300 tables from on prem sql server to delta lake. 

I will load CDC from Raw. First step is to move CDC data to bronze with autoloader. Then using delta stream get changes from bronze, make simple datatype changes and merge this dataset to silver.

Previously I had just 20 tables, so I made 20 notebooks. But now with 300 tables seems to me little bit too many notebooks. 

I was thinking to make one notebook with loop through all tables and skip datatype changes. This approach would not be running in parallel in databricks job, right?

Is there any way to generate notebooks based on some metadata (table name, schema)?

Is it a good idea to run job where is 300 notebooks? Number of records from CDC will not be in millions, but let's say in thousands.

 

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @alesventus ,

You can apply metadata/config driven approch. You can create control table (or json/yaml file) with all information that are required for processing like:

- table name

- target table

- table primary keys

- transformation to apply

And then loop on each table and pass parameters to generic notebook.

View solution in original post

2 REPLIES 2

szymon_dybczak
Esteemed Contributor III

Hi @alesventus ,

You can apply metadata/config driven approch. You can create control table (or json/yaml file) with all information that are required for processing like:

- table name

- target table

- table primary keys

- transformation to apply

And then loop on each table and pass parameters to generic notebook.

Hi @szymon_dybczak 

I followed your advice, and it worked pretty well. I have generic notebook which accepts parameters as table name and primary keys.

In databricks with CLI command  databricks jobs create --json {...} I can create job with hundreds of tasks in one shot. Skeleton of the jobs is very similar and tasks I generate in excel. 

Thanks for idea.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group