Databricks Community

brahaman · ‎08-06-2025

Hey there !

So I'm fairly new to AI and RAG, and at this moment I'm trying to automatically vectorise documents (.pdf, .txt, etc...) each time a new file comes in a volume that I created.
For that I created, a job that's triggered each time a new files, it would run a suite of job including the vectorisation process.
Because I'm new to this, I chose to use the following notebooks (with minor tweaks to point at my volumes) provided by Databricks:
https://docs.databricks.com/aws/en/notebooks/source/generative-ai/unstructured-data-pipeline.html
Unfortunately I'm facing a lot of issues regarding the files that are automatically downloaded from HuggingFace because I think that the Job doesn't have a lot of possibilities about modifying the files ?

So my question would be, is there other ways to automatise this ? Is there other ways to optimise the pipeline ?

Thanks in advance 😄

mark_ott · ‎10-03-2025

To address the question about automating and optimizing document vectorization pipelines (PDF, TXT, etc.) like the Databricks unstructured data pipeline with challenges around HuggingFace model downloads and job flexibility, here are insights and alternative approaches found in recent sources:

Automation and Pipeline Optimization Approaches

The Databricks unstructured data pipeline focuses on key pipeline stages: ingestion, preprocessing, parsing, enrichment, deduplication, chunking, embedding, and indexing. Experimenting with chunk sizes, embedding models, filtering, and deduplication improves vector quality and pipeline efficiency.
Alternative tools like Vectorize offer automated vectorization with AI data extraction and optimized RAG evaluation, enabling better embedding and indexing strategies automatically.
Turbine provides an automated vector embedding pipeline solution to manage scalability, reading, chunking, embedding, and storing vectors, removing much pipeline custom implementation complexity.
Using cloud native vector services like Azure AI Search with integrated embedding during indexing and Azure Logic Apps for monitoring document uploads can automate ingestion and vectorization without needing heavy job customization.

Alternatives to HuggingFace Model Downloads in Jobs

Issues with job constraints downloading from HuggingFace can be addressed by using alternative model repositories or platforms such as ModelScope, Replicate, TensorFlow Hub, OpenAI platform, Google Vertex AI, Amazon SageMaker that support more flexible or cloud-native model hosting and deployment.
Choosing a platform with easier API integrations, caching mechanisms, or better model deployment pipelines helps avoid download issues in automated job runs.

Best Practices for RAG and Pipeline Improvement

Modular pipelines with query classification, hybrid retrieval, reranking, and summarization yield better RAG performance.
Using smaller embedding models (e.g., 768-dim instead of 1536-dim) can drastically improve speed and storage with minimal quality loss.
Efficient metadata extraction, document filtering, and deduplication prevent redundancy and improve indexing

View solution in original post

mark_ott · ‎10-03-2025

To address the question about automating and optimizing document vectorization pipelines (PDF, TXT, etc.) like the Databricks unstructured data pipeline with challenges around HuggingFace model downloads and job flexibility, here are insights and alternative approaches found in recent sources:

Automation and Pipeline Optimization Approaches

The Databricks unstructured data pipeline focuses on key pipeline stages: ingestion, preprocessing, parsing, enrichment, deduplication, chunking, embedding, and indexing. Experimenting with chunk sizes, embedding models, filtering, and deduplication improves vector quality and pipeline efficiency.
Alternative tools like Vectorize offer automated vectorization with AI data extraction and optimized RAG evaluation, enabling better embedding and indexing strategies automatically.
Turbine provides an automated vector embedding pipeline solution to manage scalability, reading, chunking, embedding, and storing vectors, removing much pipeline custom implementation complexity.
Using cloud native vector services like Azure AI Search with integrated embedding during indexing and Azure Logic Apps for monitoring document uploads can automate ingestion and vectorization without needing heavy job customization.

Alternatives to HuggingFace Model Downloads in Jobs

Issues with job constraints downloading from HuggingFace can be addressed by using alternative model repositories or platforms such as ModelScope, Replicate, TensorFlow Hub, OpenAI platform, Google Vertex AI, Amazon SageMaker that support more flexible or cloud-native model hosting and deployment.
Choosing a platform with easier API integrations, caching mechanisms, or better model deployment pipelines helps avoid download issues in automated job runs.

Best Practices for RAG and Pipeline Improvement

Modular pipelines with query classification, hybrid retrieval, reranking, and summarization yield better RAG performance.
Using smaller embedding models (e.g., 768-dim instead of 1536-dim) can drastically improve speed and storage with minimal quality loss.
Efficient metadata extraction, document filtering, and deduplication prevent redundancy and improve indexing