databricks clusters failed

seefoods
Valued Contributor

Hello guyz, 

when i run process to parse pdf with  docling on serveless cluster using wheel python i get this error. Someone know what's happend?

Cordially 

INTERNAL: [ENVIRONMENT_SETUP_ERROR.PYTHON_NOTEBOOK_ENVIRONMENT] An internal error occurred while setting up the UDF environment: Failed to set up the Python notebook environment (hash: -xxxx) for Spark session 7f99ed6e-a888-000d-dff-73d2d8f04974. If the issue persists, please contact Databricks support. SQLSTATE: XX000

 

Sidhant07
Databricks Employee
Databricks Employee

Hi @seefoods ,

Can you please share the full error stack trace consisting of cause by msg.

Also, may I know if using Classic Compute and Install libraries during cluster creation works without any issues?

 

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @seefoods,

Interesting scenario. docling is a powerful PDF parsing library and it is great that you are exploring it on Databricks. The ENVIRONMENT_SETUP_ERROR.PYTHON_NOTEBOOK_ENVIRONMENT error you are seeing is related to how serverless compute handles Python environment setup, and there are a few things to check and try.

UNDERSTANDING THE ERROR

This error occurs when the serverless compute environment fails to build or activate the Python environment required for your Spark session. With a library like docling, the most common causes are:

1. Native/system-level dependencies that cannot be installed on serverless compute
2. Large dependency footprint exceeding environment limits (docling pulls in PyTorch and several other heavy packages)
3. Dependency conflicts with pre-installed packages on the serverless runtime

Docling requires PyTorch and optionally depends on system-level packages like Tesseract (for OCR). Serverless compute does not support init scripts or system-level package installation, so any dependency that requires native OS libraries beyond what is pre-installed can trigger this error.

RECOMMENDED STEPS TO RESOLVE

1. Try Classic Compute First (as Sidhant07 suggested)

Switch to a classic all-purpose or job cluster where you have full control over the environment. On classic compute you can:
- Use an init script to install system dependencies (e.g., tesseract, leptonica)
- Install your wheel as a cluster library or notebook-scoped library
- Set environment variables like TESSDATA_PREFIX if needed for OCR

This will help you confirm whether the issue is specifically a serverless limitation or a problem with the wheel itself.

2. If You Need Serverless, Minimize Dependencies

If serverless is a requirement, try trimming the docling dependency tree:
- Install only the core docling package without optional OCR extras
- Avoid extras that pull in system-level dependencies (e.g., tesseract)
- Make sure your wheel does not bundle or depend on PySpark (installing PySpark on serverless will break your session)

In the notebook Environment side panel (or in your job environment YAML), add your dependencies like:

docling
--no-deps

Or specify only the exact sub-packages you need to reduce the footprint.

3. Check Wheel Compatibility

Make sure your Python wheel was built for the correct platform and Python version:
- Serverless compute runs Linux x86_64
- Check the Python version of your serverless runtime (typically Python 3.10+)
- If your wheel contains compiled C extensions, they must be built for manylinux

You can verify by checking the wheel filename, for example:
your_package-1.0-cp310-cp310-manylinux_2_17_x86_64.whl

4. Use Unity Catalog Volumes for Wheel Storage

Upload your wheel to a Unity Catalog volume and reference it from there:

/Volumes/<catalog>/<schema>/<volume>/your_package.whl

This is the recommended approach for custom wheels on serverless.

5. Review the Full Error Stack Trace

As Sidhant07 mentioned, the full stack trace (including any "Caused by" messages) will give more detail on exactly which dependency or step failed. You can find this in the driver logs or the notebook cell output.

EXAMPLE: RUNNING DOCLING ON CLASSIC COMPUTE

Here is a pattern that works on a classic cluster:

Step 1 - Create an init script (save to DBFS or a volume):

#!/bin/bash
apt-get update
apt-get install -y tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev

Step 2 - Configure the cluster to use the init script

Step 3 - Install docling via notebook:

%pip install docling

Step 4 - Use docling in your notebook:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("/path/to/your/file.pdf")
print(result.document.export_to_markdown())

DOCUMENTATION REFERENCES

- Serverless compute dependencies: https://docs.databricks.com/en/compute/serverless/dependencies.html
- Serverless compute limitations: https://docs.databricks.com/en/compute/serverless/limitations.html
- Libraries on Databricks: https://docs.databricks.com/en/libraries/index.html
- Docling installation guide: https://docling-project.github.io/docling/getting_started/installation/

SUMMARY

The most likely cause is that docling (or one of its dependencies) requires native system libraries that are not available on serverless compute. Start by testing on a classic cluster to confirm it works there, then if you need serverless, work on trimming the dependency tree to pure-Python packages only. If you share the full stack trace, we can narrow down exactly which dependency is causing the failure.

Hope this helps, and welcome to the community!

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

View solution in original post