cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to open a file in dbfs. Trying to move files from Google Bucket to Azure Blob Storage

editter
New Contributor II

Background:

I am attempting to download the google cloud sdk on Databricks. The end goal is to be able to use the sdk to transfer files from a Google Cloud Bucket to Azure Blob Storage using Databricks. (If you have any other ideas for this transfer please feel free to share. I do not want to use Azure Data Factory.)

I also have Unity Catalog enabled if that makes a difference.

Right now, I was first attempting to unzip the google cloud sdk in dbfs after I moved it to the following location. I know the file exists here:

 

%fs
ls dbfs:/tmp/google_sdk

Returns:
dbfs:/tmp/google_sdk/google_cloud_sdk_352_0_0_linux_x86_64_tar.gz

 

I have tried the following to open the file with tarfile. None have worked:

 

tar = tarfile.open('dbfs:/tmp/google_sdk/google_cloud_sdk_352_0_0_linux_x86_64_tar.gz', mode="r|gz")

tar = tarfile.open('/dbfs/tmp/google_sdk/google_cloud_sdk_352_0_0_linux_x86_64_tar.gz', mode="r|gz")

tar = tarfile.open('/tmp/google_sdk/google_cloud_sdk_352_0_0_linux_x86_64_tar.gz', mode="r|gz")

tar = tarfile.open('/dbfs/dbfs/tmp/google_sdk/google_cloud_sdk_352_0_0_linux_x86_64_tar.gz', mode="r|gz")

tar = tarfile.open('dbfs/tmp/google_sdk/google_cloud_sdk_352_0_0_linux_x86_64_tar.gz', mode="r|gz")

tar = tarfile.open('tmp/google_sdk/google_cloud_sdk_352_0_0_linux_x86_64_tar.gz', mode="r|gz")

 

All of them returning that no such file or directory exists, but I know it does. What am I missing here? Why am I not able to open this file?

Thanks for any help!

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @editter , 

The tarfile.open() function expects a file path or filename as input instead of a Databricks File System (DBFS) path. You need tarfile.open() function accepts filename or file-object as input file. For reading a file from DBFS, you need to use the "/dbfs/<path>" prefix.

Try the following:

This will extract the contents of the tar.gz file to the current working directory of the Databricks notebook.

Based on the details you provided, Google Cloud SDK can be installed on the Azure Databricks workspace using below steps:

  1. Create a cluster with custom requirements to include google-cloud-sdk.
  2. Install google-cloud-sdk using the script action feature of Databricks.

Here is an example script action to install the google-cloud-sdk on the Databricks cluster:

# Define the script to install the Google Cloud SDK
script = """
# Download and install the Google Cloud SDK
curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz --output google-cloud-sdk.tar.gz
tar --extract --gzip --file google-cloud-sdk.tar.gz

# Install the Google Cloud SDK
./google-cloud-sdk/install.sh --usage-reporting=false --command-completion=false --path-update=false

# Remove the tar.gz file and the extracted directory
rm google-cloud-sdk.tar.gz
rm -r google-cloud-sdk
"""

# Define the script action
install_gcloud_sdk = {
  "name": "Install Google Cloud SDK",
  "driver_node_type_id": "Standard_DS3_v2",
  "python_version": "3",
  "script": script
}

# Submit the script action to the cluster
response = dbutils.cluster.submit_run(cluster_id, install_gcloud_sdk)
Replace the cluster_id with the id of your existing cluster. After running this script, the Google Cloud SDK will be installed on the cluster and you can use it to transfer files from a Google Cloud Bucket to Azure Blob Storage using Databricks.

editter
New Contributor II

Thanks you for the response!

2 Questions:

1. How would you create a cluster with the custom requirements for the google cloud sdk? Is that still possible for a Unity Catalog enabled cluster with Shared Access Mode?

2. Is a script action the same as a cluster init script? I couldn't find any documentation for script actions. 

I tried running that script on an existing cluster and it returned an AttributeError with no description. Just points to the line running the dbutils.cluster.submit_run (which I also can't find documenation for this command). I verified the cluster_id and driver_node_type_id were correct. 

Thanks for any help