cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Building a Document Store on Databricks

David_K93
Contributor

Hello,

I am somewhat new to Databricks and am trying to build a Q&A application based on a collection of documents. I need to move .pdf and .docx files from my local machine to storage in Databricks and eventually a document store. My questions are:

  1. What is the most effective method for uploading my documents to Databricks storage?
  2. What are the best document stores to use to build this application?

Thank you!

David

1 ACCEPTED SOLUTION

Accepted Solutions

David_K93
Contributor

Hi all,

I took an initial stab at task one with some success using the Databricks CLI. Here are the steps below:

  1. Open Command/Anaconda prompt and enter: pip install databricks-cli
  2. Go to your Databricks console and under settings find "User Settings" and then "Access Tokens" to generate an access token, save this in a .txt or elsewhere
  3. Back to the command prompt and type: databricks configure --token and enter your Databricks host site (https://<host-name>.cloud.databricks.com/) as well as the access token you just generated as they are prompted to you by the command prompt
  4. You should have access but test it by trying to list the contents of the root directory by typing: databricks fs ls dbfs:/
  5. In command prompt, navigate to the directory of files that you want to transfer to Databricks
  6. Make a new directory in Databricks with the following example command: databricks fs mkdirs dbfs:/new_dir
  7. Copy over the files using the following command: databricks fs cp -r C:/Users/source_dir dbfs:/new_dir

This should put all your files into Databricks.

View solution in original post

1 REPLY 1

David_K93
Contributor

Hi all,

I took an initial stab at task one with some success using the Databricks CLI. Here are the steps below:

  1. Open Command/Anaconda prompt and enter: pip install databricks-cli
  2. Go to your Databricks console and under settings find "User Settings" and then "Access Tokens" to generate an access token, save this in a .txt or elsewhere
  3. Back to the command prompt and type: databricks configure --token and enter your Databricks host site (https://<host-name>.cloud.databricks.com/) as well as the access token you just generated as they are prompted to you by the command prompt
  4. You should have access but test it by trying to list the contents of the root directory by typing: databricks fs ls dbfs:/
  5. In command prompt, navigate to the directory of files that you want to transfer to Databricks
  6. Make a new directory in Databricks with the following example command: databricks fs mkdirs dbfs:/new_dir
  7. Copy over the files using the following command: databricks fs cp -r C:/Users/source_dir dbfs:/new_dir

This should put all your files into Databricks.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!