10-10-2023 11:18 PM - edited 10-10-2023 11:20 PM
Hi There
I am hoping for some guidance I have some 850 tables that I need to ingest using a DLT Pipeline. When I do this my event log shows that driver node dies becomes unresponsive likely due to GC.
Can DLT be used to ingest large number of tables
Is there some way for me to batch these tables so that I can create dlt tables 50 odd at a time.My tables will be streaming tables and hte plan is for them to run continuously
What can I do to ameliorate these?
I am on azure cloud is there a particular compute type that would be beneficial to read larger number of tables ?
Kind Regards
Priya
10-11-2023 04:25 AM - edited 10-11-2023 04:27 AM
This can be controlled at workflow level, my opinion would be to batch it basis schema
10-11-2023 02:46 PM
The only issue is though hat the tables are largely from one schema 🙂 I wonder if there is an upper limit on the number of tables in a dlt pipeline/
11-02-2023 03:49 AM
Delta Live Tables (DLT) can indeed be used to ingest a large number of tables. However, if you're experiencing issues with the driver node becoming unresponsive due to garbage collection (GC), it might be a sign that the resources allocated to the driver are insufficient.To manage the ingestion of a large number of tables, you can consider batching the tables. You can create multiple DLT pipelines, each handling a subset of the tables. This way, you can distribute the load across multiple pipelines, reducing the pressure on a single pipeline and potentially mitigating the GC issue.In terms of compute type on Azure, you might want to consider using larger VM sizes for your Databricks clusters, especially for the driver node, to handle the load of reading a large number of tables. The choice of VM size would depend on the size and complexity of your tables.Also, consider tuning the Spark configurations related to memory management and GC. For instance, you can adjust the Spark driver memory, the fraction of memory dedicated to Spark's storage and execution, and the GC settings.
09-23-2024 10:11 AM
@Sidhant07 this is useful information however are having similar issue in our pipeline. The pipeline has multiple sub pipelines. We have about 26 streaming tables as part of the pipeline and this pipeline is hogging cpu of the job compute cluster, both driver and worker nodes.
There are 5 worker nodes each with type Standard_D4ads_v5 (4 core, 16GB memory) and a driver node with type Standard_D8ads_v5 (8 core, 32GB memory) . All 5 workers are running and they are either dark orange or turned red which means running very hot on cpu. Driver is worse running at over 95% cpu.
How can we troubleshoot which part of pipeline is hogging the CPU ? e.g. which sub pipeline is the cause of issue or which part of sub pipeline is causing cpu hogging and how to narrow down the issue
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now