Databricks Community

venkatgmf · ‎07-22-2024

DLT Pipeline Faling due to INTERNAL_ERROR: Communication lost with driver. Cluster 0719-162209-rx37csry was not reachable for 120 seconds

szymon_dybczak · ‎07-22-2024

Hi @venkatgmf ,

Yeah, you are right that high number of tables could be a problem

If you're experiencing issues with the driver node becoming unresponsive due to garbage collection (GC), it might be a sign that the resources allocated to the driver are insufficient.To manage the ingestion of a large number of tables, you can consider batching the tables. You can create multiple DLT pipelines, each handling a subset of the tables. This way, you can distribute the load across multiple pipelines, reducing the pressure on a single pipeline and potentially mitigating the GC issue.In terms of compute type on Azure, you might want to consider using larger VM sizes for your Databricks clusters, especially for the driver node, to handle the load of reading a large number of tables. The choice of VM size would depend on the size and complexity of your tables.Also, consider tuning the Spark configurations related to memory management and GC. For instance, you can adjust the Spark driver memory, the fraction of memory dedicated to Spark's storage and execution, and the GC settings.

Could attach also cluster logs? Also, take a look on below articles to find out most probable cause of this issue

https://kb.databricks.com/en_US/jobs/driver-unavailable

View solution in original post

szymon_dybczak · ‎07-22-2024

Hi @venkatgmf ,

Yeah, you are right that high number of tables could be a problem

If you're experiencing issues with the driver node becoming unresponsive due to garbage collection (GC), it might be a sign that the resources allocated to the driver are insufficient.To manage the ingestion of a large number of tables, you can consider batching the tables. You can create multiple DLT pipelines, each handling a subset of the tables. This way, you can distribute the load across multiple pipelines, reducing the pressure on a single pipeline and potentially mitigating the GC issue.In terms of compute type on Azure, you might want to consider using larger VM sizes for your Databricks clusters, especially for the driver node, to handle the load of reading a large number of tables. The choice of VM size would depend on the size and complexity of your tables.Also, consider tuning the Spark configurations related to memory management and GC. For instance, you can adjust the Spark driver memory, the fraction of memory dedicated to Spark's storage and execution, and the GC settings.

Could attach also cluster logs? Also, take a look on below articles to find out most probable cause of this issue

https://kb.databricks.com/en_US/jobs/driver-unavailable

Databricks Community

DLT Pipeline failing (due > 500 tables) any graph tables limitation

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Databricks Community

DLT Pipeline failing (due &gt; 500 tables) any graph tables limitation

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

DLT Pipeline failing (due > 500 tables) any graph tables limitation