cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta live tables for large number of tables

priyanananthram
New Contributor II

Hi There 

I am hoping for some guidance I have some 850 tables that I need to ingest using  a DLT Pipeline. When I do this my event log shows that driver node dies becomes unresponsive likely due to GC.

Can DLT be used to ingest large number of tables

Is there some way for me to batch these tables so that I can create dlt tables 50 odd at a time.My tables will be streaming tables and hte plan is for them to run continuously

What can I do to ameliorate these?

I am on azure cloud is there a particular compute type that would be beneficial to read larger number of tables ?

Kind Regards

Priya

4 REPLIES 4

Faisal
Contributor

This can be controlled at workflow level, my opinion would be to batch it basis schema

priyanananthram
New Contributor II

The only issue is though hat the tables are largely from one schema 🙂 I wonder if there is an upper limit on the number of tables in a dlt pipeline/

Sidhant07
Databricks Employee
Databricks Employee

Delta Live Tables (DLT) can indeed be used to ingest a large number of tables. However, if you're experiencing issues with the driver node becoming unresponsive due to garbage collection (GC), it might be a sign that the resources allocated to the driver are insufficient.To manage the ingestion of a large number of tables, you can consider batching the tables. You can create multiple DLT pipelines, each handling a subset of the tables. This way, you can distribute the load across multiple pipelines, reducing the pressure on a single pipeline and potentially mitigating the GC issue.In terms of compute type on Azure, you might want to consider using larger VM sizes for your Databricks clusters, especially for the driver node, to handle the load of reading a large number of tables. The choice of VM size would depend on the size and complexity of your tables.Also, consider tuning the Spark configurations related to memory management and GC. For instance, you can adjust the Spark driver memory, the fraction of memory dedicated to Spark's storage and execution, and the GC settings.

 

@Sidhant07 this is useful information however are having similar issue in our pipeline. The pipeline has multiple sub pipelines. We have about 26 streaming tables as part of the pipeline and this pipeline is hogging cpu of the job compute cluster, both driver and worker nodes. 

There are 5 worker nodes each with type Standard_D4ads_v5 (4 core, 16GB memory) and a driver node with type Standard_D8ads_v5 (8 core, 32GB memory) . All 5 workers are running and they are either dark orange or turned red which means running very hot on cpu. Driver is worse running at over 95% cpu. 

How can we troubleshoot which part of pipeline is hogging the CPU ? e.g. which sub pipeline is the cause of issue or which part of sub pipeline is causing cpu hogging and how to narrow down the issue

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group