Databricks

ML2022 · ‎02-29-2024

Hello, everyone!

I'm currently working on building a Retrieval-Augmented Generation (RAG) system, where the goal is to extract text from various document types (including PDFs, Excel files, HTML pages, text documents, Docs, PPTs, and notably, Keynote presentations and MP4 video files). I plan to convert this extracted text into embeddings and store them in a vector database for efficient retrieval.

For most document types, I've found robust Python libraries that serve the purpose well. For instance, libraries like PyMuPDF for PDFs and pandas for Excel files have been quite helpful. However, I'm encountering challenges with extracting data from Keynote files and MP4 files. Specifically:

Keynote Files: I'm looking for a reliable method or library that can help me extract textual content from Keynote presentations. Given the proprietary nature of the Keynote format, I'm not sure of the best approach to access and process these files programmatically.
MP4 Files: My objective here is to extract spoken text from video files. I'm aware of the general approach involving speech-to-text technologies but am seeking recommendations for specific libraries or APIs that can efficiently process MP4 files to extract accurate transcriptions.

The extracted text from these various sources will be crucial in building a comprehensive dataset for my RAG system, aimed at improving the relevance and accuracy of generated content based on a query.

If anyone has experience or suggestions on extracting text from Keynote and MP4 files, or if you have worked on similar RAG systems and can offer insights, I would greatly appreciate your advice. Additionally, any tips on processing these files at scale or integrating them into a vector database would be incredibly helpful.

Thank you in advance for your help and suggestions!

Kaniz · ‎03-15-2024

Hi @ML2022, Building a Retrieval-Augmented Generation (RAG) system sounds like an exciting project!

Let’s dive into your challenges with extracting text from Keynote files and MP4 video files:

Keynote Files:
- Extracting textual content from Keynote presentations can be done by converting the presentation to a “zipped” file folder. Here’s how:
  1. Locate the Keynote presentation (or PowerPoint).
  2. Make a copy of the presentation file (right-click and select “Duplicate”).
  3. If you see a message about using .zip, click “Use .zip”.
  4. Double-click the duplicated file to unzip it.
  5. The videos, sound files, or documents are found in the “Data” folder within the unzipped folder ¹.
- Alternatively, you can try looking for packaged content within the Keynote file itself. Right-click on the Keynote document and look for the option called “Show Package Contents”. This will reveal a folder with all the documents embedded in the Keynote file ¹.
MP4 Files:
- To extract spoken text from video files (MP4), you can use speech-to-text technologies. Here are some steps:
  - Extract the audio data from the video file. You can use tools like FFmpeg to convert the video file to an audio file.
  - Store the audio data in a Cloud Storage bucket or convert it to base64 encoding.
  - If you’re using a local file, convert the audio file to base64-encoded data.
  - Finally, send a transcription request to a speech-to-text service like Google Cloud Speech-to-Text. Specify the source of the original audio for better results ².
Integration with Vector Database:
- Once you have the extracted text, consider storing it in a vector database for efficient retrieval. You can use libraries like Annoy, Faiss, or Elasticsearch to create embeddings and index your data.
- Ensure that the vector database supports fast similarity search to retrieve relevant content based on queries.

Remember to handle these processes at scale efficiently, especially if you’re dealing with a large number of files. Good luck with your RAG system, and feel free to reach out if you need further assistance! 🚀📊🔍

Databricks

Seeking Advice: Extracting Text from Keynote and MP4 Files for RAG Implementation

Databricks Community Social - May 2024

🔔 Attention Databricks Academy Users: SSO Implementation Incoming! Secure Your Account Today!

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs