Databricks Community

sage5616 · ‎07-26-2022

I am trying to execute a local PySpark script on a Databricks cluster via dbx utility to test how passing arguments to python works in Databricks when developing locally. However, the test arguments I am passing are not being read for some reason. Could someone help? Following this guide, but it is a bit unclear and lacks good examples. https://dbx.readthedocs.io/en/latest/quickstart.html Found this, but it also not clear: How can I pass and than get the passed arguments in databricks job

Databricks manuals are very much not clear in this area.

My PySpark script:

import sys
 
n = len(sys.argv)
print("Total arguments passed:", n)
 
print("Script name", sys.argv[0])
 
print("\nArguments passed:", end=" ")
for i in range(1, n):
    print(sys.argv[i], end=" ")

dbx deployment.json:

{
  "default": {
    "jobs": [
      {
        "name": "parameter-test",
        "spark_python_task": {
            "python_file": "parameter-test.py"
        },
        "parameters": [
          "test-argument-1",
          "test-argument-2"
        ]
      }
    ]
  }
}

dbx execute command:

dbx execute\
  --cluster-id=<reducted>\
  --job=parameter-test\
  --deployment-file=conf/deployment.json\
  --no-rebuild\
  --no-package

Output:

(parameter-test) user@735 parameter-test % /bin/zsh /Users/user/g-drive/git/parameter-test/parameter-test.sh
[dbx][2022-07-26 10:34:33.864] Using profile provided from the project file
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verifying it
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verification successful
[dbx][2022-07-26 10:34:33.866] Profile DEFAULT will be used for deployment
[dbx][2022-07-26 10:34:35.897] Executing job: parameter-test in environment default on cluster None (id: 0513-204842-7b2r325u)
[dbx][2022-07-26 10:34:35.897] No rebuild will be done, please ensure that the package distribution is in dist folder
[dbx][2022-07-26 10:34:35.897] Using the provided deployment file conf/deployment.json
[dbx][2022-07-26 10:34:35.899] Preparing interactive cluster to accept jobs
[dbx][2022-07-26 10:34:35.997] Cluster is ready
[dbx][2022-07-26 10:34:35.998] Preparing execution context
[dbx][2022-07-26 10:34:36.534] Existing context is active, using it
[dbx][2022-07-26 10:34:36.992] Requirements file requirements.txt is not provided, following the execution without any additional packages
[dbx][2022-07-26 10:34:36.992] Package was disabled via --no-package, only the code from entrypoint will be used
[dbx][2022-07-26 10:34:37.161] Processing parameters
[dbx][2022-07-26 10:34:37.449] Processing parameters - done
[dbx][2022-07-26 10:34:37.449] Starting entrypoint file execution
[dbx][2022-07-26 10:34:37.767] Command successfully executed
Total arguments passed: 1
Script name python
 
Arguments passed:
[dbx][2022-07-26 10:34:37.768] Command execution finished
(parameter-test) user@735 parameter-test %

Please help 🙂

sage5616 · ‎07-27-2022

Thank you Hubert. Happy to say that this example has helped. I was able to figure it out.

Corrected deployment.json:

{
  "default": {
    "jobs": [
      {
        "name": "parameter-test",
        "spark_python_task": {
          "python_file": "parameter-test.py",
          "parameters": [
            "test1",
            "test2"
          ]
        }
      }
    ]
  }
}

Output of the Python code posted originally, above:

Total arguments passed: 3
Script name python
 
Arguments passed: test1 test2

For some reason, the name of my Python script is returned as just "python", but the actual name is "parameter-test.py". Any idea why Databricks/DBX does that? Any way to get the actual script name from sys.argv[0]?

P.S. Again, there are not enough clear, working examples in the manuals (just a feedback, take it FWIW).

View solution in original post

Hubert-Dudek · ‎07-27-2022

You can pass parameters using

dbx launch --parameters

If you want to define it in the deployment template please try to follow exactly databricks API 2.1 schema https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsCreate (for example parameters are inside a task, there is task array, both are missing in your json)

{
    "default": {
        "jobs": [
          {
          "name": "A multitask job",
          "tasks": [
            {"task_key": "Sessionize",
             "description": "Extracts session data from events",
             "depends_on": [ ]
             "spark_python_task": {
                "python_file": "com.databricks.Sessionize",
                "parameters": ["--data",  "dbfs:/path/to/data.json"]
              }
         ]
....

sage5616 · ‎07-27-2022

Thank you Hubert. Happy to say that this example has helped. I was able to figure it out.

Corrected deployment.json:

{
  "default": {
    "jobs": [
      {
        "name": "parameter-test",
        "spark_python_task": {
          "python_file": "parameter-test.py",
          "parameters": [
            "test1",
            "test2"
          ]
        }
      }
    ]
  }
}

Output of the Python code posted originally, above:

Total arguments passed: 3
Script name python
 
Arguments passed: test1 test2

For some reason, the name of my Python script is returned as just "python", but the actual name is "parameter-test.py". Any idea why Databricks/DBX does that? Any way to get the actual script name from sys.argv[0]?

P.S. Again, there are not enough clear, working examples in the manuals (just a feedback, take it FWIW).