cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Accessing workspace files within cluster init script

glebex
New Contributor II

Greetings all!

I am currently facing an issue while accessing workspace files from the init script.

As it was explained in the documentation, it is possible to place init script inside workspace files (link). This works perfectly fine and init script is being actually executed.

However, it seems that it is not possible to reference a workspace file from the init script itself. E.g. if I placed pyproject.toml file inside my workspace folder (/Workspace/Users/username@email.com/pyproject.toml). Accessing this pyproject.toml within init script fails.

I also tried to debug it a bit and tried to list root directory ("/") and /Workspace directory during init script execution. The result of "ls /" outputs /Workspace folder as visible, however, "ls /Workspace" throws an error:

ls: cannot open directory '/Workspace': Invalid argument

I'm using Azure Databricks with cluster created by me with DB runtime 12.2 LTS ML. Workspace is created as premium and I'm admin on this workspace.

As I see others also are facing the same issue

Regards,

Gleb Smolnik

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Gleb Smolnik​ :

The init script runs on the cluster nodes before the notebook execution, and it does not have direct access to workspace files.

The documentation you mentioned refers to placing the init script inside a workspace file, which means you can store the script itself in a file within the Databricks workspace. However, it doesn't grant direct access to other workspace files from within the init script.

To access a workspace file within the init script, you can consider using the Databricks CLI or Databricks API to retrieve the file and then copy or read it on the cluster nodes during the init script execution.

View solution in original post

8 REPLIES 8

-werners-
Esteemed Contributor III

here is a similar topic.

We haven't figured it out yet but it might be helpful for you.

FRG96
New Contributor III

Link isn't working anymore

Anonymous
Not applicable

@Gleb Smolnik​ :

The init script runs on the cluster nodes before the notebook execution, and it does not have direct access to workspace files.

The documentation you mentioned refers to placing the init script inside a workspace file, which means you can store the script itself in a file within the Databricks workspace. However, it doesn't grant direct access to other workspace files from within the init script.

To access a workspace file within the init script, you can consider using the Databricks CLI or Databricks API to retrieve the file and then copy or read it on the cluster nodes during the init script execution.

glebex
New Contributor II

Hey @Suteja Kanuri​,

Thanks for your answer. I understand your point. However, I could imagine the scenario, when init script acts as an orchestrator, executing other shell scripts in a desired order. The documentation article I referenced (at least how I interpreted it) allows placing init script into workspace files, kind of implying that other files will be accessible during init script execution too (which is not the case).

Anyways, I will try to figure it out with the suggestions you provided. It will be obviously nice to have workspace files mounted to databricks cluster before init script execution (not sure is it a part of a feature roadmap, so just a suggestion).

FRG96
New Contributor III

Hi @Anonymous @glebex 

I want to use the Databricks Workspace export REST API using curl in the init script to download a workspace file locally.
What's the recommended way to pass the Databricks Instance URL and the API Token to the init script execution context?

Nitya
New Contributor II
New Contributor II

When we are using databricks CLI - it didn't copied .txt file and in the another workaround using databricks API, it is using dbfs there is no API regarding Workspace FIle. Just wanted to check if there is an another way to accessing workspace files within cluster init script.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Gleb Smolnik​​​, We haven't heard from you since the last response from @Suteja Kanuri​ and @Werner Stinckens​​, and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

jacob_hill_prof
New Contributor II

@Gleb Smolnik​ You might also want to try cloning a github repo in your init script and then storing dependencies like requirements.txt files and other init scripts there. By doing this you can pull a whole slew of init scripts to be utilized in your cluster dynamically from a versioned source.

init.sh

git clone <github repo url>my_repo.git
git -C ./my_repo checkout common_cluster_init # checkout non-main branches
pip install -r ./my_repo/init/dbricks_clusters/requirements.txt # use scripts in the repo

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!