07-06-2025 10:20 AM - edited 07-06-2025 10:22 AM
Hi,
I've created a small UC Python UDF to test whether it works with custom dependencies (new PP feature), and every time I'm getting OOM errors with this message:
Context: the function loads a SpaCy language model, processes a string and returns the number of "PERSON" entities found in that text. With a blank (lighter) model, it works fine, but with the basic "en_core_web_sm" model, it OOMs.
Hypothesis: Looks to me the small language model the function loads is too big for it to handle, so it crashes. A solution could be to increase the memory somehow.
Question: Is there a way to configure the memory that the underlying process uses (to increase it) so that the UDF doesn't crash due to OOM? Or, is there any way to solve this?
Minimally Working Example (MWE):
1. Create small mock dataset
from pyspark.sql import Row
data = [
Row(id=1, document="John Smith was born in London in 1999"),
Row(id=2, document="Alice Blake went to Colorado last winter"),
Row(id=3, document="Michael Johnson visited Paris in 2018"),
Row(id=4, document="Emma Davis moved to New York in 2005"),
Row(id=5, document="David Brown traveled to Tokyo in 2020")
]
spark.createDataFrame(data).write.saveAsTable('simple_documents')
2. Create the UC Python UDF
spark.sql("""
CREATE OR REPLACE FUNCTION count_entities_of_type(document STRING, of_type STRING) RETURNS FLOAT
LANGUAGE PYTHON
PARAMETER STYLE PANDAS
HANDLER 'handler_function'
ENVIRONMENT (
dependencies = '["spacy", "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl"]',
environment_version = 'None'
)
AS $$
'''
of_type: a valid SpaCy NER label type (e.g., PERSON, ORG, GPE, DATE, etc.)
'''
import pandas as pd
from typing import Iterator
# expensive up-front computation:
nlp = spacy.load('en_core_web_sm')
# nlp = spacy.blank('en')
def handler_function(batches: Iterator[pd.Series]):
def find_and_count_entities(text: str) -> float:
doc = nlp(text)
entities = [ent for ent in doc.ents
if ent.label_ == of_type]
return float(len(entities))
for document_series in batches:
yield document_series.apply(find_and_count_entities)
$$
""")
3. Invoke the function against the mock table (this fails)
spark.sql('''
SELECT
*
, count_entities_of_type(document, 'PERSON') AS n_entities
FROM simple_documents
LIMIT 1
''')
Any pointers to resources showing how to increase memory, or explaining whether this problem is even solvable in the first place, are greatly appreciated. Thanks!
07-07-2025 01:57 AM
What I see from the code is you are already running the code on spark.sql which should be fine. I am creating a repo; please wait.
07-07-2025 02:43 AM
Thanks for the response! In fact, I only used spark.sql because the code snippet tool of the message didn't allow me to use SQL for syntax highlighting, only Python. But that doesn't matter, the function creation works fine.
07-07-2025 02:06 AM
07-07-2025 02:45 AM
Could you invoke an action on that resulting dataframe (e.g., _sqldf.display()) to see what happens when the UDF runs for real?
07-08-2025 03:17 PM
Hello Carlos
Good day!!
I noticed that I created the repo for this, getting same error. I used serverless and even with 32GB serverless I got same error.
sorry i was busy with MSFT layoffs which affected me as well. resolving became a passion for me so doing this in my free time.
I would highly recommended to cut a ticket on databricks if you are from aws not sure about procedure but if you are using Azure cut a ticket from support so that azure databricks can help you
alternatively you can also raise a ticket on databricks but they will ask for case number from azure.
Please raise the ticket using this lik https://help.databricks.com/s/contact-us?ReqType=training Please explain the issue clearly so that it will be easy for supoort team to help easily.
07-08-2025 03:20 PM
I tried with cluster, spent some couple of hours to load some libraries but unable to do. may be someone else can help you on this.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now