Databricks Community

talenik · ‎08-30-2024

Hi Everyone,

I am trying to access DBFS files while cluster is starting in init script on GCP databricks, but I am not able to list files which are there on DBFS. I tried to download files from GCS bucket as well but init script throws timeout errors.

I want to place one jar in /databricks/jar while cluster is starting

Thanks in advance!!

jason34 · ‎08-31-2024

Hello,

To access DBFS files or download from GCS bucket within a Databricks cluster's init script, consider the following approaches:
Install Databricks Connect on your local machine. Connect to your Databricks cluster using Databricks Connect. Use the Databricks Connect API to list DBFS files, download from GCS, or place jars.
Within your init script, import the necessary Spark libraries. Use the dbutils.fs API to interact with DBFS. List files, download from GCS, or place jars using Spark's capabilities.
Create a custom script (e.g., Python or Bash) to handle DBFS operations. Place the script in your cluster's init script directory. Execute the script from your init script to perform the desired actions.
Code Example (using Databricks Connect):

import databricks.koalas as ks

# Connect to your Databricks cluster
client = ks.Client()

# List files in a DBFS directory
files = client.files.list("/path/to/your/directory")

# Download a file from GCS
client.files.download("gs://your-gcs-bucket/file.txt", "/path/to/your/local/file.txt")

# Upload a jar to DBFS
client.files.put("gs://your-gcs-bucket/your-jar.jar", "/databricks/jars/your-jar.jar")

Key Considerations:
Initialization time: Ensure your init script has enough time to complete the operations before the cluster starts. Dependencies: If using Spark's dbfs API or a custom script, include necessary dependencies in your cluster's configuration. Error handling: Implement proper error handling to gracefully handle exceptions or timeouts. Databricks Connect configuration: If using Databricks Connect, ensure it's configured correctly with your cluster's credentials and URL.
By following these approaches and addressing the potential challenges, you should be able to successfully access DBFS files and perform other operations within your Databricks cluster's init script.

ky kynect