<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Problems with unstructured_data_pipeline in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135897#M1266</link>
    <description>&lt;P&gt;Hi everyone,&lt;BR /&gt;I'm currently working with the unstructured data pipeline in Databricks, using the official notebook provided by Databricks without any modifications. Strangely, despite being an out-of-the-box resource, the notebook fails during execution with the following error:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File &amp;lt;command-1127042695011754&amp;gt;, line 240, in _recursive_character_text_splitter
  File &amp;lt;command-1127042695011754&amp;gt;, line 62, in &amp;lt;lambda&amp;gt;
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/transformers/utils/hub.py", line 462, in cached_file
    except HFValidationError as e:
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1010, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1127, in _hf_hub_download_to_cache_dir
    os.makedirs(os.path.dirname(blob_path), exist_ok=True)
  File "&amp;lt;frozen os&amp;gt;", line 216, in makedirs
  File "&amp;lt;frozen os&amp;gt;", line 216, in makedirs
  File "&amp;lt;frozen os&amp;gt;", line 216, in makedirs
  File "&amp;lt;frozen os&amp;gt;", line 230, in makedirs
OSError: [Errno 30] Read-only file system: '/local_disk0/tmp'

Write not supported
Files in Workspace are read-only from executors. Please consider using Volumes if you need to persist data written from executors.
 &lt;/LI-CODE&gt;&lt;P&gt;The error seems to come from the Hugging Face transformers library trying to download or cache a tokenizer model, but it fails because the executor environment doesn't allow writing to /local_disk0/tmp.&lt;/P&gt;&lt;P&gt;What’s puzzling is that this notebook is supposed to be plug-and-play. Has anyone else encountered this issue? Are there known workarounds or fixes—perhaps involving Volumes or changing the cache directory?&lt;/P&gt;&lt;P&gt;Any help or insight would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thanks,&lt;BR /&gt;Mariano&lt;/P&gt;</description>
    <pubDate>Thu, 23 Oct 2025 19:49:05 GMT</pubDate>
    <dc:creator>Mariano-Vertiz</dc:creator>
    <dc:date>2025-10-23T19:49:05Z</dc:date>
    <item>
      <title>Problems with unstructured_data_pipeline</title>
      <link>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135897#M1266</link>
      <description>&lt;P&gt;Hi everyone,&lt;BR /&gt;I'm currently working with the unstructured data pipeline in Databricks, using the official notebook provided by Databricks without any modifications. Strangely, despite being an out-of-the-box resource, the notebook fails during execution with the following error:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File &amp;lt;command-1127042695011754&amp;gt;, line 240, in _recursive_character_text_splitter
  File &amp;lt;command-1127042695011754&amp;gt;, line 62, in &amp;lt;lambda&amp;gt;
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/transformers/utils/hub.py", line 462, in cached_file
    except HFValidationError as e:
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1010, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30d95ded-138f-42e0-83c5-245d1d30255a/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1127, in _hf_hub_download_to_cache_dir
    os.makedirs(os.path.dirname(blob_path), exist_ok=True)
  File "&amp;lt;frozen os&amp;gt;", line 216, in makedirs
  File "&amp;lt;frozen os&amp;gt;", line 216, in makedirs
  File "&amp;lt;frozen os&amp;gt;", line 216, in makedirs
  File "&amp;lt;frozen os&amp;gt;", line 230, in makedirs
OSError: [Errno 30] Read-only file system: '/local_disk0/tmp'

Write not supported
Files in Workspace are read-only from executors. Please consider using Volumes if you need to persist data written from executors.
 &lt;/LI-CODE&gt;&lt;P&gt;The error seems to come from the Hugging Face transformers library trying to download or cache a tokenizer model, but it fails because the executor environment doesn't allow writing to /local_disk0/tmp.&lt;/P&gt;&lt;P&gt;What’s puzzling is that this notebook is supposed to be plug-and-play. Has anyone else encountered this issue? Are there known workarounds or fixes—perhaps involving Volumes or changing the cache directory?&lt;/P&gt;&lt;P&gt;Any help or insight would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thanks,&lt;BR /&gt;Mariano&lt;/P&gt;</description>
      <pubDate>Thu, 23 Oct 2025 19:49:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135897#M1266</guid>
      <dc:creator>Mariano-Vertiz</dc:creator>
      <dc:date>2025-10-23T19:49:05Z</dc:date>
    </item>
    <item>
      <title>Re: Problems with unstructured_data_pipeline</title>
      <link>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135902#M1267</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/193558"&gt;@Mariano-Vertiz&lt;/a&gt;&amp;nbsp;- can you please share the link to the notebook you are trying to run? Thank You!&lt;/P&gt;</description>
      <pubDate>Thu, 23 Oct 2025 21:51:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135902#M1267</guid>
      <dc:creator>dkushari</dc:creator>
      <dc:date>2025-10-23T21:51:19Z</dc:date>
    </item>
    <item>
      <title>Re: Problems with unstructured_data_pipeline</title>
      <link>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135903#M1268</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/38309"&gt;@dkushari&lt;/a&gt;, my mistake! &lt;A href="https://docs.databricks.com/aws/en/notebooks/source/generative-ai/unstructured-data-pipeline.html" target="_blank" rel="noopener"&gt;here&lt;/A&gt; is the link&lt;BR /&gt;as for my personal notebook, that is &lt;A href="https://se-prd-finance-nam.cloud.databricks.com/editor/notebooks/1127042695011735?o=5743593545561267" target="_blank" rel="noopener"&gt;here&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Thank you for the reply!&lt;/P&gt;&lt;P&gt;Mariano&lt;/P&gt;</description>
      <pubDate>Thu, 23 Oct 2025 22:01:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135903#M1268</guid>
      <dc:creator>Mariano-Vertiz</dc:creator>
      <dc:date>2025-10-23T22:01:07Z</dc:date>
    </item>
    <item>
      <title>Re: Problems with unstructured_data_pipeline</title>
      <link>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135904#M1269</link>
      <description>&lt;P&gt;&lt;SPAN&gt;No worries at all&lt;/SPAN&gt;&lt;SPAN&gt;&lt;SPAN&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/193558"&gt;@Mariano-Vertiz&lt;/a&gt;.&lt;SPAN&gt;&amp;nbsp;Are you trying to extract information from a bunch of PDFs and query those, or use them&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;as a chatbot?&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;If yes, can you look at Agent Bricks -&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/generative-ai/agent-bricks/" target="_blank"&gt;https://docs.databricks.com/aws/en/generative-ai/agent-bricks/&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Information Extraction -&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/generative-ai/agent-bricks/key-info-extraction" target="_blank"&gt;https://docs.databricks.com/aws/en/generative-ai/agent-bricks/key-info-extraction&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Knowledge Assistant -&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/generative-ai/agent-bricks/knowledge-assistant" target="_blank"&gt;https://docs.databricks.com/aws/en/generative-ai/agent-bricks/knowledge-assistant&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 23 Oct 2025 22:24:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135904#M1269</guid>
      <dc:creator>dkushari</dc:creator>
      <dc:date>2025-10-23T22:24:31Z</dc:date>
    </item>
    <item>
      <title>Re: Problems with unstructured_data_pipeline</title>
      <link>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135991#M1275</link>
      <description>&lt;P&gt;Yes, both. I am looking to vectorize a bunch of pdfs and then feed them into a Knowledge assistant. I was told there would be better performance if the knowledge assistant was fed a vector search index rather than the files directly. Ultimately this knowledge assistant would then be part of a multi-agent supervisor.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Oct 2025 19:17:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/135991#M1275</guid>
      <dc:creator>Mariano-Vertiz</dc:creator>
      <dc:date>2025-10-24T19:17:27Z</dc:date>
    </item>
    <item>
      <title>Re: Problems with unstructured_data_pipeline</title>
      <link>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/136002#M1277</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/193558"&gt;@Mariano-Vertiz&lt;/a&gt;&amp;nbsp;- Which access mode are you using for your cluster - dedicated or standard? I think it is failing as a standard cluster does not allow the low-level operation it is trying to perform in cell 42. Is that where it's failing? I tried end-to-end with a dedicated cluster, and it worked as expected. Please try with a dedicated cluster. Here is my run &lt;A href="https://github.com/dipankarkush-db/dbcommunity/blob/main/unstructured-data-pipeline.dbc" target="_self"&gt;as dbc file&lt;/A&gt; uploaded to a public git. Download and import into Databricks.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;os.environ['TRANSFORMERS_CACHE'] = '/dbfs/tmp/transformers_cache'&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 24 Oct 2025 20:49:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/problems-with-unstructured-data-pipeline/m-p/136002#M1277</guid>
      <dc:creator>dkushari</dc:creator>
      <dc:date>2025-10-24T20:49:46Z</dc:date>
    </item>
  </channel>
</rss>

