Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @jayshan,

I'm sorry for the delayed response to your question. And, thanks for the extra details and for sharing your workaround.

This behaviour is tied to how Spark Connect ML works in serverless mode, rather than a traditional JVM/GC leak. On serverless, fitted models are cached on the driver, and there are strict limits on the maximum size of a single model and the total size of all models cached in a Spark session.

When you run training multiple times in the same underlying session, previously fitted models can remain in that server‑side cache. After enough runs, the cache hits its configured limits, and you see errors like the one you are seeing. 

Changing the serverless memory (16 GB ↔ 32 GB) forces a brand‑new backend instance, which effectively gives you a fresh Spark session and an empty ML cache. That’s why it "fixes" things temporarily. It’s not really fixing a leak. It’s just a hard reset.

Having checked internally, here are the supported options as of today...

  • Keep individual models below the documented size limit (for example, by reducing feature dimensionality or model complexity).
  • Avoid accumulating many fitted models in a long‑lived session. Explicitly deleting model variables in your client code helps the cache clean up.
  • For larger or repeated training runs, consider using a Standard/shared or dedicated cluster, which has much higher and more configurable limits than serverless.

Our engineering team is actively evolving Spark Connect ML support on serverless (including error handling and documentation), but there isn’t currently a switch that removes these size limits. If you are on a paid workspace and this is blocking you, I’d recommend opening a support case with your workspace ID and an example job run. Support can check whether configuration changes are appropriate for your environment.

Hope this helps.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post