<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Vectorisation job automatisation and errors in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/vectorisation-job-automatisation-and-errors/m-p/133663#M1186</link>
    <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;To address the question about automating and optimizing document vectorization pipelines (PDF, TXT, etc.) like the Databricks unstructured data pipeline with challenges around HuggingFace model downloads and job flexibility, here are insights and alternative approaches found in recent sources:&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Automation and Pipeline Optimization Approaches&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The Databricks unstructured data pipeline focuses on key pipeline stages: ingestion, preprocessing, parsing, enrichment, deduplication, chunking, embedding, and indexing. Experimenting with chunk sizes, embedding models, filtering, and deduplication improves vector quality and pipeline efficiency.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Alternative tools like Vectorize offer automated vectorization with AI data extraction and optimized RAG evaluation, enabling better embedding and indexing strategies automatically.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Turbine provides an automated vector embedding pipeline solution to manage scalability, reading, chunking, embedding, and storing vectors, removing much pipeline custom implementation complexity.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Using cloud native vector services like Azure AI Search with integrated embedding during indexing and Azure Logic Apps for monitoring document uploads can automate ingestion and vectorization without needing heavy job customization.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Alternatives to HuggingFace Model Downloads in Jobs&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Issues with job constraints downloading from HuggingFace can be addressed by using alternative model repositories or platforms such as ModelScope, Replicate, TensorFlow Hub, OpenAI platform, Google Vertex AI, Amazon SageMaker that support more flexible or cloud-native model hosting and deployment.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Choosing a platform with easier API integrations, caching mechanisms, or better model deployment pipelines helps avoid download issues in automated job runs.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Best Practices for RAG and Pipeline Improvement&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Modular pipelines with query classification, hybrid retrieval, reranking, and summarization yield better RAG performance.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Using smaller embedding models (e.g., 768-dim instead of 1536-dim) can drastically improve speed and storage with minimal quality loss.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Efficient metadata extraction, document filtering, and deduplication prevent redundancy and improve indexing&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Fri, 03 Oct 2025 11:07:54 GMT</pubDate>
    <dc:creator>mark_ott</dc:creator>
    <dc:date>2025-10-03T11:07:54Z</dc:date>
    <item>
      <title>Vectorisation job automatisation and errors</title>
      <link>https://community.databricks.com/t5/generative-ai/vectorisation-job-automatisation-and-errors/m-p/127632#M1087</link>
      <description>&lt;P&gt;Hey there !&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;So I'm fairly new to AI and RAG, and at this moment I'm trying to automatically vectorise documents (.pdf, .txt, etc...) each time a new file comes in a volume that I created.&lt;BR /&gt;For that I created, a job that's triggered each time a new files, it would run a suite of job including the vectorisation process.&lt;BR /&gt;Because I'm new to this, I chose to use the following notebooks (with minor tweaks to point at my volumes) provided by Databricks:&amp;nbsp;&lt;BR /&gt;&lt;A href="https://docs.databricks.com/aws/en/notebooks/source/generative-ai/unstructured-data-pipeline.html" target="_blank"&gt;https://docs.databricks.com/aws/en/notebooks/source/generative-ai/unstructured-data-pipeline.html&lt;/A&gt;&lt;BR /&gt;Unfortunately I'm facing a lot of issues regarding the files that are automatically downloaded from HuggingFace because I think that the&amp;nbsp;&lt;EM&gt;Job&lt;/EM&gt; doesn't have a lot of possibilities about modifying the files ?&lt;/P&gt;&lt;P&gt;So my question would be, is there other ways to automatise this ? Is there other ways to optimise the pipeline ?&lt;/P&gt;&lt;P&gt;Thanks in advance&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":grinning_face_with_smiling_eyes:"&gt;😄&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 06 Aug 2025 20:36:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/vectorisation-job-automatisation-and-errors/m-p/127632#M1087</guid>
      <dc:creator>brahaman</dc:creator>
      <dc:date>2025-08-06T20:36:55Z</dc:date>
    </item>
    <item>
      <title>Re: Vectorisation job automatisation and errors</title>
      <link>https://community.databricks.com/t5/generative-ai/vectorisation-job-automatisation-and-errors/m-p/133663#M1186</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;To address the question about automating and optimizing document vectorization pipelines (PDF, TXT, etc.) like the Databricks unstructured data pipeline with challenges around HuggingFace model downloads and job flexibility, here are insights and alternative approaches found in recent sources:&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Automation and Pipeline Optimization Approaches&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The Databricks unstructured data pipeline focuses on key pipeline stages: ingestion, preprocessing, parsing, enrichment, deduplication, chunking, embedding, and indexing. Experimenting with chunk sizes, embedding models, filtering, and deduplication improves vector quality and pipeline efficiency.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Alternative tools like Vectorize offer automated vectorization with AI data extraction and optimized RAG evaluation, enabling better embedding and indexing strategies automatically.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Turbine provides an automated vector embedding pipeline solution to manage scalability, reading, chunking, embedding, and storing vectors, removing much pipeline custom implementation complexity.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Using cloud native vector services like Azure AI Search with integrated embedding during indexing and Azure Logic Apps for monitoring document uploads can automate ingestion and vectorization without needing heavy job customization.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Alternatives to HuggingFace Model Downloads in Jobs&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Issues with job constraints downloading from HuggingFace can be addressed by using alternative model repositories or platforms such as ModelScope, Replicate, TensorFlow Hub, OpenAI platform, Google Vertex AI, Amazon SageMaker that support more flexible or cloud-native model hosting and deployment.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Choosing a platform with easier API integrations, caching mechanisms, or better model deployment pipelines helps avoid download issues in automated job runs.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Best Practices for RAG and Pipeline Improvement&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Modular pipelines with query classification, hybrid retrieval, reranking, and summarization yield better RAG performance.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Using smaller embedding models (e.g., 768-dim instead of 1536-dim) can drastically improve speed and storage with minimal quality loss.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Efficient metadata extraction, document filtering, and deduplication prevent redundancy and improve indexing&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Fri, 03 Oct 2025 11:07:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/vectorisation-job-automatisation-and-errors/m-p/133663#M1186</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-10-03T11:07:54Z</dc:date>
    </item>
  </channel>
</rss>

