cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Seeking Advice: Extracting Text from Keynote and MP4 Files for RAG Implementation

ML2022
New Contributor III

Hello, everyone!

I'm currently working on building a Retrieval-Augmented Generation (RAG) system, where the goal is to extract text from various document types (including PDFs, Excel files, HTML pages, text documents, Docs, PPTs, and notably, Keynote presentations and MP4 video files). I plan to convert this extracted text into embeddings and store them in a vector database for efficient retrieval.

For most document types, I've found robust Python libraries that serve the purpose well. For instance, libraries like PyMuPDF for PDFs and pandas for Excel files have been quite helpful. However, I'm encountering challenges with extracting data from Keynote files and MP4 files. Specifically:

  • Keynote Files: I'm looking for a reliable method or library that can help me extract textual content from Keynote presentations. Given the proprietary nature of the Keynote format, I'm not sure of the best approach to access and process these files programmatically.

  • MP4 Files: My objective here is to extract spoken text from video files. I'm aware of the general approach involving speech-to-text technologies but am seeking recommendations for specific libraries or APIs that can efficiently process MP4 files to extract accurate transcriptions.

The extracted text from these various sources will be crucial in building a comprehensive dataset for my RAG system, aimed at improving the relevance and accuracy of generated content based on a query.

If anyone has experience or suggestions on extracting text from Keynote and MP4 files, or if you have worked on similar RAG systems and can offer insights, I would greatly appreciate your advice. Additionally, any tips on processing these files at scale or integrating them into a vector database would be incredibly helpful.

Thank you in advance for your help and suggestions!

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group