topic Re: Delta live tables for large number of tables in Data Engineering

Delta live tables for large number of tables

priyanananthram — Wed, 11 Oct 2023 06:20:10 GMT

Hi There

I am hoping for some guidance I have some 850 tables that I need to ingest using a DLT Pipeline. When I do this my event log shows that driver node dies becomes unresponsive likely due to GC.

Can DLT be used to ingest large number of tables

Is there some way for me to batch these tables so that I can create dlt tables 50 odd at a time.My tables will be streaming tables and hte plan is for them to run continuously

What can I do to ameliorate these?

I am on azure cloud is there a particular compute type that would be beneficial to read larger number of tables ?

Kind Regards

Priya

Re: Delta live tables for large number of tables

Faisal — Wed, 11 Oct 2023 11:27:42 GMT

This can be controlled at workflow level, my opinion would be to batch it basis schema

Re: Delta live tables for large number of tables

priyanananthram — Wed, 11 Oct 2023 21:46:04 GMT

The only issue is though hat the tables are largely from one schema 🙂 I wonder if there is an upper limit on the number of tables in a dlt pipeline/

Re: Delta live tables for large number of tables

Sidhant07 — Thu, 02 Nov 2023 10:49:37 GMT

Delta Live Tables (DLT) can indeed be used to ingest a large number of tables. However, if you're experiencing issues with the driver node becoming unresponsive due to garbage collection (GC), it might be a sign that the resources allocated to the driver are insufficient.To manage the ingestion of a large number of tables, you can consider batching the tables. You can create multiple DLT pipelines, each handling a subset of the tables. This way, you can distribute the load across multiple pipelines, reducing the pressure on a single pipeline and potentially mitigating the GC issue.In terms of compute type on Azure, you might want to consider using larger VM sizes for your Databricks clusters, especially for the driver node, to handle the load of reading a large number of tables. The choice of VM size would depend on the size and complexity of your tables.Also, consider tuning the Spark configurations related to memory management and GC. For instance, you can adjust the Spark driver memory, the fraction of memory dedicated to Spark's storage and execution, and the GC settings.

Re: Delta live tables for large number of tables

PushkarDeole — Mon, 23 Sep 2024 17:11:23 GMT

@Sidhant07 this is useful information however are having similar issue in our pipeline. The pipeline has multiple sub pipelines. We have about 26 streaming tables as part of the pipeline and this pipeline is hogging cpu of the job compute cluster, both driver and worker nodes.

There are 5 worker nodes each with type Standard_D4ads_v5 (4 core, 16GB memory) and a driver node with type Standard_D8ads_v5 (8 core, 32GB memory) . All 5 workers are running and they are either dark orange or turned red which means running very hot on cpu. Driver is worse running at over 95% cpu.

How can we troubleshoot which part of pipeline is hogging the CPU ? e.g. which sub pipeline is the cause of issue or which part of sub pipeline is causing cpu hogging and how to narrow down the issue