cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Building a Document Store on Databricks

David_K93
Contributor

Hello,

I am somewhat new to Databricks and am trying to build a Q&A application based on a collection of documents. I need to move .pdf and .docx files from my local machine to storage in Databricks and eventually a document store. My questions are:

  1. What is the most effective method for uploading my documents to Databricks storage?
  2. What are the best document stores to use to build this application?

Thank you!

David

1 ACCEPTED SOLUTION

Accepted Solutions

David_K93
Contributor

Hi all,

I took an initial stab at task one with some success using the Databricks CLI. Here are the steps below:

  1. Open Command/Anaconda prompt and enter: pip install databricks-cli
  2. Go to your Databricks console and under settings find "User Settings" and then "Access Tokens" to generate an access token, save this in a .txt or elsewhere
  3. Back to the command prompt and type: databricks configure --token and enter your Databricks host site (https://<host-name>.cloud.databricks.com/) as well as the access token you just generated as they are prompted to you by the command prompt
  4. You should have access but test it by trying to list the contents of the root directory by typing: databricks fs ls dbfs:/
  5. In command prompt, navigate to the directory of files that you want to transfer to Databricks
  6. Make a new directory in Databricks with the following example command: databricks fs mkdirs dbfs:/new_dir
  7. Copy over the files using the following command: databricks fs cp -r C:/Users/source_dir dbfs:/new_dir

This should put all your files into Databricks.

View solution in original post

1 REPLY 1

David_K93
Contributor

Hi all,

I took an initial stab at task one with some success using the Databricks CLI. Here are the steps below:

  1. Open Command/Anaconda prompt and enter: pip install databricks-cli
  2. Go to your Databricks console and under settings find "User Settings" and then "Access Tokens" to generate an access token, save this in a .txt or elsewhere
  3. Back to the command prompt and type: databricks configure --token and enter your Databricks host site (https://<host-name>.cloud.databricks.com/) as well as the access token you just generated as they are prompted to you by the command prompt
  4. You should have access but test it by trying to list the contents of the root directory by typing: databricks fs ls dbfs:/
  5. In command prompt, navigate to the directory of files that you want to transfer to Databricks
  6. Make a new directory in Databricks with the following example command: databricks fs mkdirs dbfs:/new_dir
  7. Copy over the files using the following command: databricks fs cp -r C:/Users/source_dir dbfs:/new_dir

This should put all your files into Databricks.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group