cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to handle 100+ tables ETL through spark structured streaming?

Zair
New Contributor II

I am writing a streaming job which will be performing ETL for more than 130 tables. I would like to know is there any other better way to do this. Another solution I am thinking is to write separate streaming job for all tables.

source data is coming from CDC through Events Hub on real time.

2 REPLIES 2

artsheiko
Valued Contributor III
Valued Contributor III

Hi, I guess to answer your question it might be helpful to get more details on what you're trying to achieve and the bottleneck that you encounter now.

Indeed handle the processing of 130 tables in one monolith could be challenging as the business rules might change in the future, one day the frequency of processing may also become different (for example, you will understand that some information can be processed in a batch mode).

It will also be useful to consider this problem from the point of view of the team : in the case of processing within the same streaming job, in the future, most likely you will not be able to distribute tasks to develop / support this processing simultaneously among several team members.

Zair
New Contributor II

HI @Artem Sheiko​,

Thank you for your detailed reply, I do understand what you are referring to, but there are no requirements to process the data in batches, also its just a replica of original transactional database, so without changing anything in terms of transformation we need to copy data to delta lake.

Considering a small team do we really need to split up the data stream to many small single table streams? How that will impact system performance, as per my understanding with streaming we start big and then gradually break it down to smaller streams if required?

If you can refer me some documentation, that will be really helpful.

Thanks