a month ago
When I train models in the serverless environment V4 (Premium Plan), the system occasionally returns the error message listed below, especially after running the model training code multiple times. We have tried creating new serverless sessions, which sometimes helps but not consistently. We also followed the suggestion in the error message and deleted previously trained models, but that did not resolve the issue. This behavior also occurs in the Free Edition.
"[CONNECT_ML. MODEL_SIZE_OVERFLOW_EXCEPTION] Generic Spark Connect ML error. The fitted or loaded model size is about 572745576 bytes. Please fit or load a model smaller than 268435456 bytes. SQLSTATE: XX000"
Based on some preliminary research, I suspect this may be related to incomplete memory cleanup or garbage collection. We would greatly appreciate it if you could check whether there is a known solution or connect us with the appropriate person to help investigate further.
a week ago
Hi @jayshan,
I'm sorry for the delayed response to your question. And, thanks for the extra details and for sharing your workaround.
This behaviour is tied to how Spark Connect ML works in serverless mode, rather than a traditional JVM/GC leak. On serverless, fitted models are cached on the driver, and there are strict limits on the maximum size of a single model and the total size of all models cached in a Spark session.
When you run training multiple times in the same underlying session, previously fitted models can remain in that server‑side cache. After enough runs, the cache hits its configured limits, and you see errors like the one you are seeing.
Changing the serverless memory (16 GB ↔ 32 GB) forces a brand‑new backend instance, which effectively gives you a fresh Spark session and an empty ML cache. That’s why it "fixes" things temporarily. It’s not really fixing a leak. It’s just a hard reset.
Having checked internally, here are the supported options as of today...
Our engineering team is actively evolving Spark Connect ML support on serverless (including error handling and documentation), but there isn’t currently a switch that removes these size limits. If you are on a paid workspace and this is blocking you, I’d recommend opening a support case with your workspace ID and an example job run. Support can check whether configuration changes are appropriate for your environment.
Hope this helps.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
4 weeks ago
This issue seems to be a bug in the serverless environment v4. The serverless instance does not clean up no-use models timely, which leads to insufficient space for new model training. To force resetting the serveless instance, neither the "New Session" nor the "Terminate" functions work perfectly. We have to reconfigure the serverless environment by changing the memory, either from 16GB to 32GB or from 32GB to 16GB. This forces the serveless instance to be fully reset, then rerunning the model training won't generate the same error again. But this is just a temporary solution, because after several runs of model training, the same error will occur again. Hope the Serverless team can fix this bug ASAP.
4 weeks ago
Thanks for sharing workaround @jayshan. I hope that this will be fixed soon 🙂
a week ago
Hi @jayshan,
I'm sorry for the delayed response to your question. And, thanks for the extra details and for sharing your workaround.
This behaviour is tied to how Spark Connect ML works in serverless mode, rather than a traditional JVM/GC leak. On serverless, fitted models are cached on the driver, and there are strict limits on the maximum size of a single model and the total size of all models cached in a Spark session.
When you run training multiple times in the same underlying session, previously fitted models can remain in that server‑side cache. After enough runs, the cache hits its configured limits, and you see errors like the one you are seeing.
Changing the serverless memory (16 GB ↔ 32 GB) forces a brand‑new backend instance, which effectively gives you a fresh Spark session and an empty ML cache. That’s why it "fixes" things temporarily. It’s not really fixing a leak. It’s just a hard reset.
Having checked internally, here are the supported options as of today...
Our engineering team is actively evolving Spark Connect ML support on serverless (including error handling and documentation), but there isn’t currently a switch that removes these size limits. If you are on a paid workspace and this is blocking you, I’d recommend opening a support case with your workspace ID and an example job run. Support can check whether configuration changes are appropriate for your environment.
Hope this helps.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
a week ago
I'm also adding a documentation link for you to refer to.