cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Out of memory error when installing environment dependencies of UC Python UDF

carlosjuribe
New Contributor III

Hi,

I've created a small UC Python UDF to test whether it works with custom dependencies (new PP feature), and every time I'm getting OOM errors with this message:

 
[UDF_ENVIRONMENT_USER_ERROR.OUT_OF_MEMORY] Failed to install UDF dependencies for <catalog>.<schema>.<function>. Installation crashed due to running out of memory. SQLSTATE: 39000

Context: the function loads a SpaCy language model, processes a string and returns the number of "PERSON" entities found in that text. With a blank (lighter) model, it works fine, but with the basic "en_core_web_sm" model, it OOMs.

Hypothesis: Looks to me the small language model the function loads is too big for it to handle, so it crashes. A solution could be to increase the memory somehow.

Question: Is there a way to configure the memory that the underlying process uses (to increase it) so that the UDF doesn't crash due to OOM? Or, is there any way to solve this?

Minimally Working Example (MWE):

1. Create small mock dataset

from pyspark.sql import Row

data = [
    Row(id=1, document="John Smith was born in London in 1999"),
    Row(id=2, document="Alice Blake went to Colorado last winter"),
    Row(id=3, document="Michael Johnson visited Paris in 2018"),
    Row(id=4, document="Emma Davis moved to New York in 2005"),
    Row(id=5, document="David Brown traveled to Tokyo in 2020")
]

spark.createDataFrame(data).write.saveAsTable('simple_documents')

2. Create the UC Python UDF

spark.sql("""
CREATE OR REPLACE FUNCTION count_entities_of_type(document STRING, of_type STRING) RETURNS FLOAT
LANGUAGE PYTHON
PARAMETER STYLE PANDAS
HANDLER 'handler_function'
ENVIRONMENT (
  dependencies = '["spacy", "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl"]',
  environment_version = 'None'
)
AS $$
'''
of_type: a valid SpaCy NER label type (e.g., PERSON, ORG, GPE, DATE, etc.)
'''
import pandas as pd
from typing import Iterator

# expensive up-front computation:
nlp = spacy.load('en_core_web_sm')
# nlp = spacy.blank('en')

def handler_function(batches: Iterator[pd.Series]):
    def find_and_count_entities(text: str) -> float:
        doc = nlp(text)
        entities = [ent for ent in doc.ents
                        if ent.label_ == of_type]
        return float(len(entities))
    
    for document_series in batches:
        yield document_series.apply(find_and_count_entities)
$$
""")

3. Invoke the function against the mock table (this fails)

spark.sql('''
SELECT 
  *
, count_entities_of_type(document, 'PERSON') AS n_entities
FROM simple_documents
LIMIT 1
''')

Any pointers to resources showing how to increase memory, or explaining whether this problem is even solvable in the first place, are greatly appreciated. Thanks!

6 REPLIES 6

Khaja_Zaffer
Contributor
Translator
 Hello Carlos, 


What I see from the code is you are already running the code on spark.sql which should be fine. I am creating a repo; please wait. 

Thanks for the response! In fact, I only used spark.sql because the code snippet tool of the message didn't allow me to use SQL for syntax highlighting, only Python. But that doesn't matter, the function creation works fine.

Khaja_Zaffer
Contributor
Translator
Hello, Carlos,
Can you try with serverless once? I got the below result.

vkzaffer_0-1751879151238.png

Could you invoke an action on that resulting dataframe (e.g., _sqldf.display()) to see what happens when the UDF runs for real?

Khaja_Zaffer
Contributor

Hello Carlos

Good day!!

vkzaffer_0-1752012998863.png

 



I noticed that I created the repo for this, getting same error. I used serverless and even with 32GB serverless I got same error. 

sorry i was busy with MSFT layoffs which affected me as well. resolving became a passion for me so doing this in my free time. 

I would highly recommended to cut a ticket on databricks if you are from aws not sure about procedure but if you are using Azure cut a ticket from support so that azure databricks can help you 

alternatively you can also raise a ticket on databricks but they will ask for case number from azure. 

 Please raise the ticket using this lik  https://help.databricks.com/s/contact-us?ReqType=training Please explain the issue clearly so that it will be easy for supoort team to help easily.

Khaja_Zaffer
Contributor

I tried with cluster, spent some couple of hours to load some libraries but unable to do. may be someone else can help you on this. 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now