06-13-2023 05:56 AM
Greetings all!
I am currently facing an issue while accessing workspace files from the init script.
As it was explained in the documentation, it is possible to place init script inside workspace files (link). This works perfectly fine and init script is being actually executed.
However, it seems that it is not possible to reference a workspace file from the init script itself. E.g. if I placed pyproject.toml file inside my workspace folder (/Workspace/Users/username@email.com/pyproject.toml). Accessing this pyproject.toml within init script fails.
I also tried to debug it a bit and tried to list root directory ("/") and /Workspace directory during init script execution. The result of "ls /" outputs /Workspace folder as visible, however, "ls /Workspace" throws an error:
ls: cannot open directory '/Workspace': Invalid argument
I'm using Azure Databricks with cluster created by me with DB runtime 12.2 LTS ML. Workspace is created as premium and I'm admin on this workspace.
As I see others also are facing the same issue
Regards,
Gleb Smolnik
06-14-2023 12:45 AM
@Gleb Smolnik :
The init script runs on the cluster nodes before the notebook execution, and it does not have direct access to workspace files.
The documentation you mentioned refers to placing the init script inside a workspace file, which means you can store the script itself in a file within the Databricks workspace. However, it doesn't grant direct access to other workspace files from within the init script.
To access a workspace file within the init script, you can consider using the Databricks CLI or Databricks API to retrieve the file and then copy or read it on the cluster nodes during the init script execution.
06-13-2023 06:08 AM
We haven't figured it out yet but it might be helpful for you.
07-06-2023 10:56 PM
Link isn't working anymore
06-14-2023 12:45 AM
@Gleb Smolnik :
The init script runs on the cluster nodes before the notebook execution, and it does not have direct access to workspace files.
The documentation you mentioned refers to placing the init script inside a workspace file, which means you can store the script itself in a file within the Databricks workspace. However, it doesn't grant direct access to other workspace files from within the init script.
To access a workspace file within the init script, you can consider using the Databricks CLI or Databricks API to retrieve the file and then copy or read it on the cluster nodes during the init script execution.
06-14-2023 01:31 AM
Hey @Suteja Kanuri,
Thanks for your answer. I understand your point. However, I could imagine the scenario, when init script acts as an orchestrator, executing other shell scripts in a desired order. The documentation article I referenced (at least how I interpreted it) allows placing init script into workspace files, kind of implying that other files will be accessible during init script execution too (which is not the case).
Anyways, I will try to figure it out with the suggestions you provided. It will be obviously nice to have workspace files mounted to databricks cluster before init script execution (not sure is it a part of a feature roadmap, so just a suggestion).
07-07-2023 12:12 AM - edited 07-07-2023 12:28 AM
Hi @Anonymous @glebex
I want to use the Databricks Workspace export REST API using curl in the init script to download a workspace file locally.
What's the recommended way to pass the Databricks Instance URL and the API Token to the init script execution context?
07-09-2023 11:07 PM
When we are using databricks CLI - it didn't copied .txt file and in the another workaround using databricks API, it is using dbfs there is no API regarding Workspace FIle. Just wanted to check if there is an another way to accessing workspace files within cluster init script.
06-20-2023 01:44 PM
@Gleb Smolnik You might also want to try cloning a github repo in your init script and then storing dependencies like requirements.txt files and other init scripts there. By doing this you can pull a whole slew of init scripts to be utilized in your cluster dynamically from a versioned source.
init.sh
git clone <github repo url>my_repo.git
git -C ./my_repo checkout common_cluster_init # checkout non-main branches
pip install -r ./my_repo/init/dbricks_clusters/requirements.txt # use scripts in the repo
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group