Hello, everyone!
I'm currently working on building a Retrieval-Augmented Generation (RAG) system, where the goal is to extract text from various document types (including PDFs, Excel files, HTML pages, text documents, Docs, PPTs, and notably, Keynote presentations and MP4 video files). I plan to convert this extracted text into embeddings and store them in a vector database for efficient retrieval.
For most document types, I've found robust Python libraries that serve the purpose well. For instance, libraries like PyMuPDF for PDFs and pandas for Excel files have been quite helpful. However, I'm encountering challenges with extracting data from Keynote files and MP4 files. Specifically:
Keynote Files: I'm looking for a reliable method or library that can help me extract textual content from Keynote presentations. Given the proprietary nature of the Keynote format, I'm not sure of the best approach to access and process these files programmatically.
MP4 Files: My objective here is to extract spoken text from video files. I'm aware of the general approach involving speech-to-text technologies but am seeking recommendations for specific libraries or APIs that can efficiently process MP4 files to extract accurate transcriptions.
The extracted text from these various sources will be crucial in building a comprehensive dataset for my RAG system, aimed at improving the relevance and accuracy of generated content based on a query.
If anyone has experience or suggestions on extracting text from Keynote and MP4 files, or if you have worked on similar RAG systems and can offer insights, I would greatly appreciate your advice. Additionally, any tips on processing these files at scale or integrating them into a vector database would be incredibly helpful.
Thank you in advance for your help and suggestions!