cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Not able to access dbfs in init script GCP databricks

talenik
New Contributor III

Hi Everyone, 

I am trying to access DBFS files while cluster is starting in init script on GCP databricks, but I am not able to list files which are there on DBFS. I tried to download files from GCS bucket as well but init script throws timeout errors.

I want to place one jar in /databricks/jar while cluster is starting

Thanks in advance!!

1 REPLY 1

jason34
New Contributor II

Hello,

To access DBFS files or download from GCS bucket within a Databricks cluster's init script, consider the following approaches:
Install Databricks Connect on your local machine. Connect to your Databricks cluster using Databricks Connect. Use the Databricks Connect API to list DBFS files, download from GCS, or place jars.
Within your init script, import the necessary Spark libraries. Use the dbutils.fs API to interact with DBFS. List files, download from GCS, or place jars using Spark's capabilities.
Create a custom script (e.g., Python or Bash) to handle DBFS operations. Place the script in your cluster's init script directory. Execute the script from your init script to perform the desired actions.
Code Example (using Databricks Connect):

import databricks.koalas as ks

# Connect to your Databricks cluster
client = ks.Client()

# List files in a DBFS directory
files = client.files.list("/path/to/your/directory")

# Download a file from GCS
client.files.download("gs://your-gcs-bucket/file.txt", "/path/to/your/local/file.txt")

# Upload a jar to DBFS
client.files.put("gs://your-gcs-bucket/your-jar.jar", "/databricks/jars/your-jar.jar")

Key Considerations:
Initialization time: Ensure your init script has enough time to complete the operations before the cluster starts. Dependencies: If using Spark's dbfs API or a custom script, include necessary dependencies in your cluster's configuration. Error handling: Implement proper error handling to gracefully handle exceptions or timeouts. Databricks Connect configuration: If using Databricks Connect, ensure it's configured correctly with your cluster's credentials and URL.
By following these approaches and addressing the potential challenges, you should be able to successfully access DBFS files and perform other operations within your Databricks cluster's init script.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group