topic Problems with unstructured_data_pipeline in Generative AI

Problems with unstructured_data_pipeline

Mariano-Vertiz — Thu, 23 Oct 2025 19:49:05 GMT

Hi everyone,
I'm currently working with the unstructured data pipeline in Databricks, using the official notebook provided by Databricks without any modifications. Strangely, despite being an out-of-the-box resource, the notebook fails during execution with the following error:

PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): File <command-1127042695011754>, line 240, in _recursive_character_text_splitter File <command-1127042695011754>, line 62, in <lambda> File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config resolved_config_file = cached_file( ^^^^^^^^^^^^ File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/transformers/utils/hub.py", line 462, in cached_file except HFValidationError as e: File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1010, in hf_hub_download return _hf_hub_download_to_cache_dir( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1127, in _hf_hub_download_to_cache_dir os.makedirs(os.path.dirname(blob_path), exist_ok=True) File "<frozen os>", line 216, in makedirs File "<frozen os>", line 216, in makedirs File "<frozen os>", line 216, in makedirs File "<frozen os>", line 230, in makedirs OSError: [Errno 30] Read-only file system: '/local_disk0/tmp' Write not supported Files in Workspace are read-only from executors. Please consider using Volumes if you need to persist data written from executors.

The error seems to come from the Hugging Face transformers library trying to download or cache a tokenizer model, but it fails because the executor environment doesn't allow writing to /local_disk0/tmp.

What’s puzzling is that this notebook is supposed to be plug-and-play. Has anyone else encountered this issue? Are there known workarounds or fixes—perhaps involving Volumes or changing the cache directory?

Any help or insight would be greatly appreciated!

Thanks,
Mariano

Re: Problems with unstructured_data_pipeline

dkushari — Thu, 23 Oct 2025 21:51:19 GMT

Hi @Mariano-Vertiz - can you please share the link to the notebook you are trying to run? Thank You!

Re: Problems with unstructured_data_pipeline

Mariano-Vertiz — Thu, 23 Oct 2025 22:01:07 GMT

Hello @dkushari, my mistake! here is the link
as for my personal notebook, that is here

Thank you for the reply!

Mariano

Re: Problems with unstructured_data_pipeline

dkushari — Thu, 23 Oct 2025 22:24:31 GMT

No worries at all, @Mariano-Vertiz. Are you trying to extract information from a bunch of PDFs and query those, or use them as a chatbot?

If yes, can you look at Agent Bricks - https://docs.databricks.com/aws/en/generative-ai/agent-bricks/

Information Extraction - https://docs.databricks.com/aws/en/generative-ai/agent-bricks/key-info-extraction

Knowledge Assistant - https://docs.databricks.com/aws/en/generative-ai/agent-bricks/knowledge-assistant

Re: Problems with unstructured_data_pipeline

Mariano-Vertiz — Fri, 24 Oct 2025 19:17:27 GMT

Yes, both. I am looking to vectorize a bunch of pdfs and then feed them into a Knowledge assistant. I was told there would be better performance if the knowledge assistant was fed a vector search index rather than the files directly. Ultimately this knowledge assistant would then be part of a multi-agent supervisor.

Re: Problems with unstructured_data_pipeline

dkushari — Fri, 24 Oct 2025 20:49:46 GMT

Hi @Mariano-Vertiz - Which access mode are you using for your cluster - dedicated or standard? I think it is failing as a standard cluster does not allow the low-level operation it is trying to perform in cell 42. Is that where it's failing? I tried end-to-end with a dedicated cluster, and it worked as expected. Please try with a dedicated cluster. Here is my run as dbc file uploaded to a public git. Download and import into Databricks.

os.environ['TRANSFORMERS_CACHE'] = '/dbfs/tmp/transformers_cache'