Databricks Community

David_K93 · ‎05-24-2023

Hello,

I am working on building a Langchain QA application in Databricks. I currently have 13 .txt files loaded into the DBFS and am trying to read them in iteratively with TextLoader(), load them into the RecursiveCharacterTextSplitter() from Langchain to chunk them and then add them to a Chroma Database. When running this from my local machine, there is no problem. But the application does not seem to accept files loaded from DBFS.

I have tried loading these in as string objects then loading them into the TextLoader() but that does not work either.

Has anyone found a workaround to this?

David_K93 · ‎05-24-2023

I ended up tinkering around and found I needed to use the os package to access it as a '/dbfs/' filepath:

#Iterate through directory of docs, load, split then add to total list

txt_ls = []

for i in os.listdir(dir_ls):

filename = os.path.join(dir_ls, i)

loader = TextLoader(filename)

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

texts = text_splitter.split_documents(documents)

txt_ls.append(texts)

View solution in original post

venkatcrc · ‎05-24-2023

Try using below.

python components need prefix '/dbfs' in path. since you are using output of dbutils.fs.ls it will have prefix as 'dbfs:'

Replace loader = TextLoader(i[0]) with loader = TextLoader(i[0].replace('dbfs:','/dbfs'))

David_K93 · ‎05-24-2023