Databricks Community

Zair · ‎08-06-2022

I am writing a streaming job which will be performing ETL for more than 130 tables. I would like to know is there any other better way to do this. Another solution I am thinking is to write separate streaming job for all tables.

source data is coming from CDC through Events Hub on real time.

artsheiko · ‎08-07-2022

Hi, I guess to answer your question it might be helpful to get more details on what you're trying to achieve and the bottleneck that you encounter now.

Indeed handle the processing of 130 tables in one monolith could be challenging as the business rules might change in the future, one day the frequency of processing may also become different (for example, you will understand that some information can be processed in a batch mode).

It will also be useful to consider this problem from the point of view of the team : in the case of processing within the same streaming job, in the future, most likely you will not be able to distribute tasks to develop / support this processing simultaneously among several team members.

Zair · ‎08-10-2022

HI @Artem Sheiko,

Thank you for your detailed reply, I do understand what you are referring to, but there are no requirements to process the data in batches, also its just a replica of original transactional database, so without changing anything in terms of transformation we need to copy data to delta lake.

Considering a small team do we really need to split up the data stream to many small single table streams? How that will impact system performance, as per my understanding with streaming we start big and then gradually break it down to smaller streams if required?

If you can refer me some documentation, that will be really helpful.

Thanks