02-08-2024 02:16 AM
Context:
IDE: IntelliJ 2023.3.2
Library: databricks-connect 13.3
Python: 3.10
Description:
I develop notebooks and python scripts locally in the IDE and I connect to the spark cluster via databricks-connect for a better developer experience.
I download a file from the public internet and I want to store it in an external Unity Catalog Volume (hosted on S3). I would like to upload the file using a volume path and not directly uploading it to S3 via AWS Credentials.
Everything works fine using a Databricks Notebook:
E.g.:
dbutils.fs.cp("<local/file/path>", "/Volumes/<path>")
or:
source_file = ...
with open("/Volumes/<path>", 'wb') as destination_file:
destination_file.write(source_file)
I can't figure out a way to do that in my IDE locally.
Using dbutils:
dbutils.fs.cp("file:/<local/path>", "/Volumes/<path>")
I get the error:
databricks.sdk.errors.mapping.InvalidParameterValue: Path must be absolute: \Volumes\<path>
Using python's with statement won't work, because the Unity Catalog Volume is not mounted to my local machine.
Is there a way to upload files from the local machine or memory into Unity Catalog Volumes?
04-02-2024 11:20 AM
Late to the discussion, but I too was looking for a way to do this _programmatically_, as opposed to the UI.
The solution I landed on was using the Python SDK (though you could assuredly do this using an API request instead if you're not in Python):
w = WorkspaceClient()
w.files.upload('/your/volume/path/foo.txt', 'foo bar')
02-12-2024 01:52 AM
Thanks for your answer. But I want to upload the files/data programmatically and not manually with the Databricks UI.
04-02-2024 11:20 AM
Late to the discussion, but I too was looking for a way to do this _programmatically_, as opposed to the UI.
The solution I landed on was using the Python SDK (though you could assuredly do this using an API request instead if you're not in Python):
w = WorkspaceClient()
w.files.upload('/your/volume/path/foo.txt', 'foo bar')
05-06-2024 02:30 AM
Thanks, that's what I was looking for.
Even though it would be nice to not read the binary but to provide just the path to the file to upload.
05-11-2024 11:59 AM
Hey Husky,
You can provide just the path to the file to upload with REST Api call. https://docs.databricks.com/api/workspace/files/upload. Its in Public Preview. Please see below.
def return_ws_url():
workspace_url = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("browserHostName")
match = re.match(r'Some\((.*)\)', str(workspace_url))
if match:
value = match.group(1)
return(value)
else:
print("No value found")
def upload_ws_file_to_volume(local_path, remote_path):
with open(local_path, 'rb') as f:
r = requests.put(
'https://{databricks_instance}/api/2.0/fs/files{path}'.format(
databricks_instance=return_ws_url(), path=remote_path),
headers=headers,
data=f)
r.raise_for_status()
headers = {'Authorization' : 'Bearer {}'.format(dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get())}
print(headers)
upload_ws_file_to_volume(<<Your source file local path>>, <<UC Volume path>>)
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group