cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Building a Document Store on Databricks

David_K93
Contributor

Hello,

I am somewhat new to Databricks and am trying to build a Q&A application based on a collection of documents. I need to move .pdf and .docx files from my local machine to storage in Databricks and eventually a document store. My questions are:

  1. What is the most effective method for uploading my documents to Databricks storage?
  2. What are the best document stores to use to build this application?

Thank you!

David

1 ACCEPTED SOLUTION

Accepted Solutions

David_K93
Contributor

Hi all,

I took an initial stab at task one with some success using the Databricks CLI. Here are the steps below:

  1. Open Command/Anaconda prompt and enter: pip install databricks-cli
  2. Go to your Databricks console and under settings find "User Settings" and then "Access Tokens" to generate an access token, save this in a .txt or elsewhere
  3. Back to the command prompt and type: databricks configure --token and enter your Databricks host site (https://<host-name>.cloud.databricks.com/) as well as the access token you just generated as they are prompted to you by the command prompt
  4. You should have access but test it by trying to list the contents of the root directory by typing: databricks fs ls dbfs:/
  5. In command prompt, navigate to the directory of files that you want to transfer to Databricks
  6. Make a new directory in Databricks with the following example command: databricks fs mkdirs dbfs:/new_dir
  7. Copy over the files using the following command: databricks fs cp -r C:/Users/source_dir dbfs:/new_dir

This should put all your files into Databricks.

View solution in original post

1 REPLY 1

David_K93
Contributor

Hi all,

I took an initial stab at task one with some success using the Databricks CLI. Here are the steps below:

  1. Open Command/Anaconda prompt and enter: pip install databricks-cli
  2. Go to your Databricks console and under settings find "User Settings" and then "Access Tokens" to generate an access token, save this in a .txt or elsewhere
  3. Back to the command prompt and type: databricks configure --token and enter your Databricks host site (https://<host-name>.cloud.databricks.com/) as well as the access token you just generated as they are prompted to you by the command prompt
  4. You should have access but test it by trying to list the contents of the root directory by typing: databricks fs ls dbfs:/
  5. In command prompt, navigate to the directory of files that you want to transfer to Databricks
  6. Make a new directory in Databricks with the following example command: databricks fs mkdirs dbfs:/new_dir
  7. Copy over the files using the following command: databricks fs cp -r C:/Users/source_dir dbfs:/new_dir

This should put all your files into Databricks.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.