Upload file from local file system to Unity Catalog Volume (via databricks-connect)

Husky
New Contributor III

Context:
IDE: IntelliJ 2023.3.2
Library: databricks-connect 13.3
Python: 3.10

Description:
I develop notebooks and python scripts locally in the IDE and I connect to the spark cluster via databricks-connect for a better developer experience.  

I download a file from the public internet and I want to store it in an external Unity Catalog Volume (hosted on S3). I would like to upload the file using a volume path and not directly uploading it to S3 via AWS Credentials.

Everything works fine using a Databricks Notebook:
E.g.:

 

dbutils.fs.cp("<local/file/path>", "/Volumes/<path>")

 

or:

 

source_file = ...
with open("/Volumes/<path>", 'wb') as destination_file:
    destination_file.write(source_file)

 

I can't figure out a way to do that in my IDE locally. 
Using dbutils:

 

dbutils.fs.cp("file:/<local/path>", "/Volumes/<path>")

 

I get the error:

 

databricks.sdk.errors.mapping.InvalidParameterValue: Path must be absolute: \Volumes\<path>

 

Using python's with statement won't work, because the Unity Catalog Volume is not mounted to my local machine.

Is there a way to upload files from the local machine or memory into Unity Catalog Volumes?

Husky
New Contributor III

Thanks for your answer. But I want to upload the files/data programmatically and not manually with the Databricks UI.

lathaniel
New Contributor III

Late to the discussion, but I too was looking for a way to do this _programmatically_, as opposed to the UI.

The solution I landed on was using the Python SDK (though you could assuredly do this using an API request instead if you're not in Python):

w = WorkspaceClient()
w.files.upload('/your/volume/path/foo.txt', 'foo bar')

View solution in original post

Husky
New Contributor III

Thanks, that's what I was looking for.

Even though it would be nice to not read the binary but to provide just the path to the file to upload.

dkushari
Databricks Employee
Databricks Employee

Hey Husky,

You can provide just the path to the file to upload with REST Api call. https://docs.databricks.com/api/workspace/files/upload. Its in Public Preview. Please see below.

def return_ws_url():
    workspace_url = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("browserHostName")
    match = re.match(r'Some\((.*)\)', str(workspace_url))
    if match:
      value = match.group(1)
      return(value)
    else:
        print("No value found")

def upload_ws_file_to_volume(local_path, remote_path):
  with open(local_path, 'rb') as f:
    r = requests.put(
      'https://{databricks_instance}/api/2.0/fs/files{path}'.format(
        databricks_instance=return_ws_url(), path=remote_path),
      headers=headers,
      data=f)
    r.raise_for_status()

headers = {'Authorization' : 'Bearer {}'.format(dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get())}
print(headers)

upload_ws_file_to_volume(<<Your source file local path>>, <<UC Volume path>>)