Databricks Community

carlosjuribe · ‎07-06-2025

Hi,

I've created a small UC Python UDF to test whether it works with custom dependencies (new PP feature), and every time I'm getting OOM errors with this message:

[UDF_ENVIRONMENT_USER_ERROR.OUT_OF_MEMORY] Failed to install UDF dependencies for <catalog>.<schema>.<function>. Installation crashed due to running out of memory. SQLSTATE: 39000

Context: the function loads a SpaCy language model, processes a string and returns the number of "PERSON" entities found in that text. With a blank (lighter) model, it works fine, but with the basic "en_core_web_sm" model, it OOMs.

Hypothesis: Looks to me the small language model the function loads is too big for it to handle, so it crashes. A solution could be to increase the memory somehow.

Question: Is there a way to configure the memory that the underlying process uses (to increase it) so that the UDF doesn't crash due to OOM? Or, is there any way to solve this?

Minimally Working Example (MWE):

1. Create small mock dataset

from pyspark.sql import Row

data = [
    Row(id=1, document="John Smith was born in London in 1999"),
    Row(id=2, document="Alice Blake went to Colorado last winter"),
    Row(id=3, document="Michael Johnson visited Paris in 2018"),
    Row(id=4, document="Emma Davis moved to New York in 2005"),
    Row(id=5, document="David Brown traveled to Tokyo in 2020")
]

spark.createDataFrame(data).write.saveAsTable('simple_documents')

2. Create the UC Python UDF

spark.sql("""
CREATE OR REPLACE FUNCTION count_entities_of_type(document STRING, of_type STRING) RETURNS FLOAT
LANGUAGE PYTHON
PARAMETER STYLE PANDAS
HANDLER 'handler_function'
ENVIRONMENT (
  dependencies = '["spacy", "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl"]',
  environment_version = 'None'
)
AS $$
'''
of_type: a valid SpaCy NER label type (e.g., PERSON, ORG, GPE, DATE, etc.)
'''
import pandas as pd
from typing import Iterator

# expensive up-front computation:
nlp = spacy.load('en_core_web_sm')
# nlp = spacy.blank('en')

def handler_function(batches: Iterator[pd.Series]):
    def find_and_count_entities(text: str) -> float:
        doc = nlp(text)
        entities = [ent for ent in doc.ents
                        if ent.label_ == of_type]
        return float(len(entities))
    
    for document_series in batches:
        yield document_series.apply(find_and_count_entities)
$$
""")

3. Invoke the function against the mock table (this fails)

spark.sql('''
SELECT 
  *
, count_entities_of_type(document, 'PERSON') AS n_entities
FROM simple_documents
LIMIT 1
''')

Any pointers to resources showing how to increase memory, or explaining whether this problem is even solvable in the first place, are greatly appreciated. Thanks!

Khaja_Zaffer · ‎07-07-2025

Translator

Hello Carlos,

What I see from the code is you are already running the code on spark.sql which should be fine. I am creating a repo; please wait.

carlosjuribe · ‎07-07-2025

Thanks for the response! In fact, I only used spark.sql because the code snippet tool of the message didn't allow me to use SQL for syntax highlighting, only Python. But that doesn't matter, the function creation works fine.

Khaja_Zaffer · ‎07-07-2025

Translator

Hello, Carlos,

Can you try with serverless once? I got the below result.

carlosjuribe · ‎07-07-2025

Could you invoke an action on that resulting dataframe (e.g., _sqldf.display()) to see what happens when the UDF runs for real?

Khaja_Zaffer · ‎07-08-2025

Hello Carlos

Good day!!

I noticed that I created the repo for this, getting same error. I used serverless and even with 32GB serverless I got same error.

sorry i was busy with MSFT layoffs which affected me as well. resolving became a passion for me so doing this in my free time.

I would highly recommended to cut a ticket on databricks if you are from aws not sure about procedure but if you are using Azure cut a ticket from support so that azure databricks can help you

alternatively you can also raise a ticket on databricks but they will ask for case number from azure.

Please raise the ticket using this lik https://help.databricks.com/s/contact-us?ReqType=training Please explain the issue clearly so that it will be easy for supoort team to help easily.