cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Upload file from local file system to Unity Catalog Volume (via databricks-connect)

Husky
New Contributor III

Context:
IDE: IntelliJ 2023.3.2
Library: databricks-connect 13.3
Python: 3.10

Description:
I develop notebooks and python scripts locally in the IDE and I connect to the spark cluster via databricks-connect for a better developer experience.  

I download a file from the public internet and I want to store it in an external Unity Catalog Volume (hosted on S3). I would like to upload the file using a volume path and not directly uploading it to S3 via AWS Credentials.

Everything works fine using a Databricks Notebook:
E.g.:

 

dbutils.fs.cp("<local/file/path>", "/Volumes/<path>")

 

or:

 

source_file = ...
with open("/Volumes/<path>", 'wb') as destination_file:
    destination_file.write(source_file)

 

I can't figure out a way to do that in my IDE locally. 
Using dbutils:

 

dbutils.fs.cp("file:/<local/path>", "/Volumes/<path>")

 

I get the error:

 

databricks.sdk.errors.mapping.InvalidParameterValue: Path must be absolute: \Volumes\<path>

 

Using python's with statement won't work, because the Unity Catalog Volume is not mounted to my local machine.

Is there a way to upload files from the local machine or memory into Unity Catalog Volumes?

1 ACCEPTED SOLUTION

Accepted Solutions

lathaniel
New Contributor III

Late to the discussion, but I too was looking for a way to do this _programmatically_, as opposed to the UI.

The solution I landed on was using the Python SDK (though you could assuredly do this using an API request instead if you're not in Python):

w = WorkspaceClient()
w.files.upload('/your/volume/path/foo.txt', 'foo bar')

View solution in original post

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @Husky, You can upload files from your local machine or memory into Unity Catalog Volumes in Databricks. 

 

Here are the steps to achieve this:

 

Ensure Prerequisites: Before you proceed, make sure you have the following:

 

  • A Databricks workspace with Unity Catalog enabled. If you haven’t set up Unity Catalog yet, refer to the documentation on getting started with Unity Catalog.
  • The necessary privileges:
    • WRITE VOLUME privilege on the target volume where you want to upload files.
    • USE SCHEMA privilege on the parent schema.
    • USE CATALOG privilege on the parent catalog.

Upload Files to Volume: Follow these steps to upload files to a Unity Catalog volume:

  • In your Databricks workspace, click New > Add Data.
  • Select Upload Files to Volume.
  • Choose a volume or a directory inside a volume, or paste a volume path.
  • Click the browse button or drag and drop files directly into the drop zone.

Additional Notes:

  • For semi-structured or structured files, you can use Auto Loader or COPY INTO to create tables from the uploaded files.
  • You can also run various machine learning and data science workloads on files within the volume.
  • Additionally, you can upload libraries, certificates, and other configuration files of arbitrary formats (e.g., .whl or .txt) that you want to use for configuring cluster libraries, notebook-scoped libraries, or job dependencies.

Remember that volumes are supported in Databricks Runtime 13.2 and above. If you encounter any issues, ensure you’re using a compatible runtime version. 

Husky
New Contributor III

Thanks for your answer. But I want to upload the files/data programmatically and not manually with the Databricks UI.

lathaniel
New Contributor III

Late to the discussion, but I too was looking for a way to do this _programmatically_, as opposed to the UI.

The solution I landed on was using the Python SDK (though you could assuredly do this using an API request instead if you're not in Python):

w = WorkspaceClient()
w.files.upload('/your/volume/path/foo.txt', 'foo bar')

Husky
New Contributor III

Thanks, that's what I was looking for.

Even though it would be nice to not read the binary but to provide just the path to the file to upload.

dkushari
New Contributor III
New Contributor III

Hey Husky,

You can provide just the path to the file to upload with REST Api call. https://docs.databricks.com/api/workspace/files/upload. Its in Public Preview. Please see below.

def return_ws_url():
    workspace_url = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("browserHostName")
    match = re.match(r'Some\((.*)\)', str(workspace_url))
    if match:
      value = match.group(1)
      return(value)
    else:
        print("No value found")

def upload_ws_file_to_volume(local_path, remote_path):
  with open(local_path, 'rb') as f:
    r = requests.put(
      'https://{databricks_instance}/api/2.0/fs/files{path}'.format(
        databricks_instance=return_ws_url(), path=remote_path),
      headers=headers,
      data=f)
    r.raise_for_status()

headers = {'Authorization' : 'Bearer {}'.format(dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get())}
print(headers)

upload_ws_file_to_volume(<<Your source file local path>>, <<UC Volume path>>)

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.