cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Generic Spark Connect ML error. The fitted or loaded model size is too big.

jayshan
New Contributor III

When I train models in the serverless environment V4 (Premium Plan), the system occasionally returns the error message listed below, especially after running the model training code multiple times. We have tried creating new serverless sessions, which sometimes helps but not consistently. We also followed the suggestion in the error message and deleted previously trained models, but that did not resolve the issue. This behavior also occurs in the Free Edition.

"[CONNECT_ML. MODEL_SIZE_OVERFLOW_EXCEPTION] Generic Spark Connect ML error. The fitted or loaded model size is about 572745576 bytes. Please fit or load a model smaller than 268435456 bytes. SQLSTATE: XX000"

Based on some preliminary research, I suspect this may be related to incomplete memory cleanup or garbage collection. We would greatly appreciate it if you could check whether there is a known solution or connect us with the appropriate person to help investigate further.

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @jayshan,

I'm sorry for the delayed response to your question. And, thanks for the extra details and for sharing your workaround.

This behaviour is tied to how Spark Connect ML works in serverless mode, rather than a traditional JVM/GC leak. On serverless, fitted models are cached on the driver, and there are strict limits on the maximum size of a single model and the total size of all models cached in a Spark session.

When you run training multiple times in the same underlying session, previously fitted models can remain in that server‑side cache. After enough runs, the cache hits its configured limits, and you see errors like the one you are seeing. 

Changing the serverless memory (16 GB ↔ 32 GB) forces a brand‑new backend instance, which effectively gives you a fresh Spark session and an empty ML cache. That’s why it "fixes" things temporarily. It’s not really fixing a leak. It’s just a hard reset.

Having checked internally, here are the supported options as of today...

  • Keep individual models below the documented size limit (for example, by reducing feature dimensionality or model complexity).
  • Avoid accumulating many fitted models in a long‑lived session. Explicitly deleting model variables in your client code helps the cache clean up.
  • For larger or repeated training runs, consider using a Standard/shared or dedicated cluster, which has much higher and more configurable limits than serverless.

Our engineering team is actively evolving Spark Connect ML support on serverless (including error handling and documentation), but there isn’t currently a switch that removes these size limits. If you are on a paid workspace and this is blocking you, I’d recommend opening a support case with your workspace ID and an example job run. Support can check whether configuration changes are appropriate for your environment.

Hope this helps.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

4 REPLIES 4

jayshan
New Contributor III

This issue seems to be a bug in the serverless environment v4. The serverless instance does not clean up no-use models timely, which leads to insufficient space for new model training. To force resetting the serveless instance, neither the "New Session" nor the "Terminate" functions work perfectly. We have to reconfigure the serverless environment by changing the memory, either from 16GB to 32GB or from 32GB to 16GB. This forces the serveless instance to be fully reset, then rerunning the model training won't generate the same error again. But this is just a temporary solution, because after several runs of model training, the same error will occur again. Hope the Serverless team can fix this bug ASAP.

szymon_dybczak
Esteemed Contributor III

Thanks for sharing workaround @jayshan. I hope that this will be fixed soon 🙂

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @jayshan,

I'm sorry for the delayed response to your question. And, thanks for the extra details and for sharing your workaround.

This behaviour is tied to how Spark Connect ML works in serverless mode, rather than a traditional JVM/GC leak. On serverless, fitted models are cached on the driver, and there are strict limits on the maximum size of a single model and the total size of all models cached in a Spark session.

When you run training multiple times in the same underlying session, previously fitted models can remain in that server‑side cache. After enough runs, the cache hits its configured limits, and you see errors like the one you are seeing. 

Changing the serverless memory (16 GB ↔ 32 GB) forces a brand‑new backend instance, which effectively gives you a fresh Spark session and an empty ML cache. That’s why it "fixes" things temporarily. It’s not really fixing a leak. It’s just a hard reset.

Having checked internally, here are the supported options as of today...

  • Keep individual models below the documented size limit (for example, by reducing feature dimensionality or model complexity).
  • Avoid accumulating many fitted models in a long‑lived session. Explicitly deleting model variables in your client code helps the cache clean up.
  • For larger or repeated training runs, consider using a Standard/shared or dedicated cluster, which has much higher and more configurable limits than serverless.

Our engineering team is actively evolving Spark Connect ML support on serverless (including error handling and documentation), but there isn’t currently a switch that removes these size limits. If you are on a paid workspace and this is blocking you, I’d recommend opening a support case with your workspace ID and an example job run. Support can check whether configuration changes are appropriate for your environment.

Hope this helps.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

I'm also adding a documentation link for you to refer to. 

 

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***