Databricks Community

jeremy98 · ‎03-04-2025

Hello community,

I was working on optimising the driver memory, since there are code that are not optimised for spark, and I was planning temporary to restart the cluster to free up the memory.

that could be a potential solution, since if the cluster is not working in the first few minutes of each hour it is a good moment to restart it and free up the memory. But I was looking about the standard output and seems that is not free any memory. Why this behaviour? I need to terminate and start the cluster instead of this previous operation?

Alberto_Umana · ‎03-04-2025

Hi @jeremy98,

Generally, Databricks recommends regularly restarting clusters, particularly interactive ones, for regular clean-up. Restarting or terminating and starting the cluster anew ensures stopping all processes and freeing up memory effectively, therefore the restart should actually clean it up. You can see in your cluster metrics once it is restarted.

jeremy98 · ‎03-04-2025

Thanks Alberto, for the clarification! Yes, it is true, effectively, the metric UI doubled the logs for the driver and for the worker/s. I think it is a normal behaviour.

Alberto_Umana · ‎03-04-2025

No problem, happy to assist!

jeremy98 · ‎03-04-2025

Hi, I have another question: Usually, the driver should free memory by itself, but is it possible that the driver fails to do so? Why does this happen, and what issues can arise from this behavior?

Alberto_Umana · ‎03-04-2025

Yes, it is actually possible that driver due to some reason did not free up memory.. if that happens the you will see these kind of failures:

Unresponsiveness: The driver may become unresponsive, leading to failed health checks and potential restarts or kills by watchdog mechanisms.
Frequent Restarts: Continuous memory pressure and GC overhead can cause the driver to restart frequently, leading to interruptions in job executions and degraded performance.
Out of Memory (OOM) Conditions: Eventually, the driver might run out of memory, leading to crashes and job failures with explicit OOM errors

jeremy98 · ‎03-04-2025

Exactly, thanks, Alberto! But in general, is it best practice to restart a cluster every week to prevent this issue? Or does this problem happen because the code is not well-written?

Alberto_Umana · ‎03-04-2025

It is best practice to restart the cluster regularly correct! Regularly restarting clusters can help mitigate memory leaks and accumulated GC issues.

And about if it happens because of your code, it depends on what you are doing and if you follow best practices, but would need more insights to tell.

jeremy98 · ‎03-05-2025

Hi,

The code synchronizes Databricks with PostgreSQL by identifying differences and applying INSERT, UPDATE, or DELETE operations to update PostgreSQL. The steps are as follows:

Read the source data in Databricks using a simple spark.sql query.
Read the data from PostgreSQL using the JDBC driver.
Perform a JOIN operation to identify differences.
Collect the data using .collect() (I am now trying to use .toLocalIterator()).
Chunk the data and iterate over it, executing DML operations using psycopg2 in batch (extras.execute_batch()), pushing a list of tuples with page_size=1000.
…and that’s all.

Could the issue be that psycopg2 is not an API call from Databricks, so execution is handled by the driver? Or is the .collect() operation causing a bottleneck by bringing too much data to the driver at once?