Hey there !
So I'm fairly new to AI and RAG, and at this moment I'm trying to automatically vectorise documents (.pdf, .txt, etc...) each time a new file comes in a volume that I created.
For that I created, a job that's triggered each time a new files, it would run a suite of job including the vectorisation process.
Because I'm new to this, I chose to use the following notebooks (with minor tweaks to point at my volumes) provided by Databricks:
https://docs.databricks.com/aws/en/notebooks/source/generative-ai/unstructured-data-pipeline.html
Unfortunately I'm facing a lot of issues regarding the files that are automatically downloaded from HuggingFace because I think that the Job doesn't have a lot of possibilities about modifying the files ?
So my question would be, is there other ways to automatise this ? Is there other ways to optimise the pipeline ?
Thanks in advance 😄