cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Vectorisation job automatisation and errors

brahaman
New Contributor II

Hey there ! 

So I'm fairly new to AI and RAG, and at this moment I'm trying to automatically vectorise documents (.pdf, .txt, etc...) each time a new file comes in a volume that I created.
For that I created, a job that's triggered each time a new files, it would run a suite of job including the vectorisation process.
Because I'm new to this, I chose to use the following notebooks (with minor tweaks to point at my volumes) provided by Databricks: 
https://docs.databricks.com/aws/en/notebooks/source/generative-ai/unstructured-data-pipeline.html
Unfortunately I'm facing a lot of issues regarding the files that are automatically downloaded from HuggingFace because I think that the Job doesn't have a lot of possibilities about modifying the files ?

So my question would be, is there other ways to automatise this ? Is there other ways to optimise the pipeline ?

Thanks in advance 😄

1 ACCEPTED SOLUTION

Accepted Solutions

mark_ott
Databricks Employee
Databricks Employee

To address the question about automating and optimizing document vectorization pipelines (PDF, TXT, etc.) like the Databricks unstructured data pipeline with challenges around HuggingFace model downloads and job flexibility, here are insights and alternative approaches found in recent sources:

Automation and Pipeline Optimization Approaches

  • The Databricks unstructured data pipeline focuses on key pipeline stages: ingestion, preprocessing, parsing, enrichment, deduplication, chunking, embedding, and indexing. Experimenting with chunk sizes, embedding models, filtering, and deduplication improves vector quality and pipeline efficiency.

  • Alternative tools like Vectorize offer automated vectorization with AI data extraction and optimized RAG evaluation, enabling better embedding and indexing strategies automatically.

  • Turbine provides an automated vector embedding pipeline solution to manage scalability, reading, chunking, embedding, and storing vectors, removing much pipeline custom implementation complexity.

  • Using cloud native vector services like Azure AI Search with integrated embedding during indexing and Azure Logic Apps for monitoring document uploads can automate ingestion and vectorization without needing heavy job customization.

Alternatives to HuggingFace Model Downloads in Jobs

  • Issues with job constraints downloading from HuggingFace can be addressed by using alternative model repositories or platforms such as ModelScope, Replicate, TensorFlow Hub, OpenAI platform, Google Vertex AI, Amazon SageMaker that support more flexible or cloud-native model hosting and deployment.

  • Choosing a platform with easier API integrations, caching mechanisms, or better model deployment pipelines helps avoid download issues in automated job runs.

Best Practices for RAG and Pipeline Improvement

  • Modular pipelines with query classification, hybrid retrieval, reranking, and summarization yield better RAG performance.

  • Using smaller embedding models (e.g., 768-dim instead of 1536-dim) can drastically improve speed and storage with minimal quality loss.

  • Efficient metadata extraction, document filtering, and deduplication prevent redundancy and improve indexing

View solution in original post

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

To address the question about automating and optimizing document vectorization pipelines (PDF, TXT, etc.) like the Databricks unstructured data pipeline with challenges around HuggingFace model downloads and job flexibility, here are insights and alternative approaches found in recent sources:

Automation and Pipeline Optimization Approaches

  • The Databricks unstructured data pipeline focuses on key pipeline stages: ingestion, preprocessing, parsing, enrichment, deduplication, chunking, embedding, and indexing. Experimenting with chunk sizes, embedding models, filtering, and deduplication improves vector quality and pipeline efficiency.

  • Alternative tools like Vectorize offer automated vectorization with AI data extraction and optimized RAG evaluation, enabling better embedding and indexing strategies automatically.

  • Turbine provides an automated vector embedding pipeline solution to manage scalability, reading, chunking, embedding, and storing vectors, removing much pipeline custom implementation complexity.

  • Using cloud native vector services like Azure AI Search with integrated embedding during indexing and Azure Logic Apps for monitoring document uploads can automate ingestion and vectorization without needing heavy job customization.

Alternatives to HuggingFace Model Downloads in Jobs

  • Issues with job constraints downloading from HuggingFace can be addressed by using alternative model repositories or platforms such as ModelScope, Replicate, TensorFlow Hub, OpenAI platform, Google Vertex AI, Amazon SageMaker that support more flexible or cloud-native model hosting and deployment.

  • Choosing a platform with easier API integrations, caching mechanisms, or better model deployment pipelines helps avoid download issues in automated job runs.

Best Practices for RAG and Pipeline Improvement

  • Modular pipelines with query classification, hybrid retrieval, reranking, and summarization yield better RAG performance.

  • Using smaller embedding models (e.g., 768-dim instead of 1536-dim) can drastically improve speed and storage with minimal quality loss.

  • Efficient metadata extraction, document filtering, and deduplication prevent redundancy and improve indexing

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now