- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-26-2022 10:50 AM
I am trying to execute a local PySpark script on a Databricks cluster via dbx utility to test how passing arguments to python works in Databricks when developing locally. However, the test arguments I am passing are not being read for some reason. Could someone help? Following this guide, but it is a bit unclear and lacks good examples. https://dbx.readthedocs.io/en/latest/quickstart.html Found this, but it also not clear: How can I pass and than get the passed arguments in databricks job
Databricks manuals are very much not clear in this area.
My PySpark script:
import sys
n = len(sys.argv)
print("Total arguments passed:", n)
print("Script name", sys.argv[0])
print("\nArguments passed:", end=" ")
for i in range(1, n):
print(sys.argv[i], end=" ")
dbx deployment.json:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py"
},
"parameters": [
"test-argument-1",
"test-argument-2"
]
}
]
}
}
dbx execute command:
dbx execute\
--cluster-id=<reducted>\
--job=parameter-test\
--deployment-file=conf/deployment.json\
--no-rebuild\
--no-package
Output:
(parameter-test) user@735 parameter-test % /bin/zsh /Users/user/g-drive/git/parameter-test/parameter-test.sh
[dbx][2022-07-26 10:34:33.864] Using profile provided from the project file
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verifying it
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verification successful
[dbx][2022-07-26 10:34:33.866] Profile DEFAULT will be used for deployment
[dbx][2022-07-26 10:34:35.897] Executing job: parameter-test in environment default on cluster None (id: 0513-204842-7b2r325u)
[dbx][2022-07-26 10:34:35.897] No rebuild will be done, please ensure that the package distribution is in dist folder
[dbx][2022-07-26 10:34:35.897] Using the provided deployment file conf/deployment.json
[dbx][2022-07-26 10:34:35.899] Preparing interactive cluster to accept jobs
[dbx][2022-07-26 10:34:35.997] Cluster is ready
[dbx][2022-07-26 10:34:35.998] Preparing execution context
[dbx][2022-07-26 10:34:36.534] Existing context is active, using it
[dbx][2022-07-26 10:34:36.992] Requirements file requirements.txt is not provided, following the execution without any additional packages
[dbx][2022-07-26 10:34:36.992] Package was disabled via --no-package, only the code from entrypoint will be used
[dbx][2022-07-26 10:34:37.161] Processing parameters
[dbx][2022-07-26 10:34:37.449] Processing parameters - done
[dbx][2022-07-26 10:34:37.449] Starting entrypoint file execution
[dbx][2022-07-26 10:34:37.767] Command successfully executed
Total arguments passed: 1
Script name python
Arguments passed:
[dbx][2022-07-26 10:34:37.768] Command execution finished
(parameter-test) user@735 parameter-test %
Please help 🙂
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-27-2022 08:45 AM
Thank you Hubert. Happy to say that this example has helped. I was able to figure it out.
Corrected deployment.json:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py",
"parameters": [
"test1",
"test2"
]
}
}
]
}
}
Output of the Python code posted originally, above:
Total arguments passed: 3
Script name python
Arguments passed: test1 test2
For some reason, the name of my Python script is returned as just "python", but the actual name is "parameter-test.py". Any idea why Databricks/DBX does that? Any way to get the actual script name from sys.argv[0]?
P.S. Again, there are not enough clear, working examples in the manuals (just a feedback, take it FWIW).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-27-2022 03:20 AM
You can pass parameters using
dbx launch --parameters
If you want to define it in the deployment template please try to follow exactly databricks API 2.1 schema https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsCreate (for example parameters are inside a task, there is task array, both are missing in your json)
{
"default": {
"jobs": [
{
"name": "A multitask job",
"tasks": [
{"task_key": "Sessionize",
"description": "Extracts session data from events",
"depends_on": [ ]
"spark_python_task": {
"python_file": "com.databricks.Sessionize",
"parameters": ["--data", "dbfs:/path/to/data.json"]
}
]
....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-27-2022 08:45 AM
Thank you Hubert. Happy to say that this example has helped. I was able to figure it out.
Corrected deployment.json:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py",
"parameters": [
"test1",
"test2"
]
}
}
]
}
}
Output of the Python code posted originally, above:
Total arguments passed: 3
Script name python
Arguments passed: test1 test2
For some reason, the name of my Python script is returned as just "python", but the actual name is "parameter-test.py". Any idea why Databricks/DBX does that? Any way to get the actual script name from sys.argv[0]?
P.S. Again, there are not enough clear, working examples in the manuals (just a feedback, take it FWIW).