cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Running local python code with arguments in Databricks via dbx utility.

sage5616
Valued Contributor

I am trying to execute a local PySpark script on a Databricks cluster via dbx utility to test how passing arguments to python works in Databricks when developing locally. However, the test arguments I am passing are not being read for some reason. Could someone help? Following this guide, but it is a bit unclear and lacks good examples. https://dbx.readthedocs.io/en/latest/quickstart.html Found this, but it also not clear: How can I pass and than get the passed arguments in databricks job

Databricks manuals are very much not clear in this area.

My PySpark script:

import sys
 
n = len(sys.argv)
print("Total arguments passed:", n)
 
print("Script name", sys.argv[0])
 
print("\nArguments passed:", end=" ")
for i in range(1, n):
    print(sys.argv[i], end=" ")

dbx deployment.json:

{
  "default": {
    "jobs": [
      {
        "name": "parameter-test",
        "spark_python_task": {
            "python_file": "parameter-test.py"
        },
        "parameters": [
          "test-argument-1",
          "test-argument-2"
        ]
      }
    ]
  }
}

dbx execute command:

dbx execute\
  --cluster-id=<reducted>\
  --job=parameter-test\
  --deployment-file=conf/deployment.json\
  --no-rebuild\
  --no-package

Output:

(parameter-test) user@735 parameter-test % /bin/zsh /Users/user/g-drive/git/parameter-test/parameter-test.sh
[dbx][2022-07-26 10:34:33.864] Using profile provided from the project file
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verifying it
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verification successful
[dbx][2022-07-26 10:34:33.866] Profile DEFAULT will be used for deployment
[dbx][2022-07-26 10:34:35.897] Executing job: parameter-test in environment default on cluster None (id: 0513-204842-7b2r325u)
[dbx][2022-07-26 10:34:35.897] No rebuild will be done, please ensure that the package distribution is in dist folder
[dbx][2022-07-26 10:34:35.897] Using the provided deployment file conf/deployment.json
[dbx][2022-07-26 10:34:35.899] Preparing interactive cluster to accept jobs
[dbx][2022-07-26 10:34:35.997] Cluster is ready
[dbx][2022-07-26 10:34:35.998] Preparing execution context
[dbx][2022-07-26 10:34:36.534] Existing context is active, using it
[dbx][2022-07-26 10:34:36.992] Requirements file requirements.txt is not provided, following the execution without any additional packages
[dbx][2022-07-26 10:34:36.992] Package was disabled via --no-package, only the code from entrypoint will be used
[dbx][2022-07-26 10:34:37.161] Processing parameters
[dbx][2022-07-26 10:34:37.449] Processing parameters - done
[dbx][2022-07-26 10:34:37.449] Starting entrypoint file execution
[dbx][2022-07-26 10:34:37.767] Command successfully executed
Total arguments passed: 1
Script name python
 
Arguments passed:
[dbx][2022-07-26 10:34:37.768] Command execution finished
(parameter-test) user@735 parameter-test % 

Please help 🙂

1 ACCEPTED SOLUTION

Accepted Solutions

Thank you Hubert. Happy to say that this example has helped. I was able to figure it out.

Corrected deployment.json:

{
  "default": {
    "jobs": [
      {
        "name": "parameter-test",
        "spark_python_task": {
          "python_file": "parameter-test.py",
          "parameters": [
            "test1",
            "test2"
          ]
        }
      }
    ]
  }
}

Output of the Python code posted originally, above:

Total arguments passed: 3
Script name python
 
Arguments passed: test1 test2

For some reason, the name of my Python script is returned as just "python", but the actual name is "parameter-test.py". Any idea why Databricks/DBX does that? Any way to get the actual script name from sys.argv[0]?

P.S. Again, there are not enough clear, working examples in the manuals (just a feedback, take it FWIW).

View solution in original post

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

You can pass parameters using

dbx launch --parameters

If you want to define it in the deployment template please try to follow exactly databricks API 2.1 schema https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsCreate (for example parameters are inside a task, there is task array, both are missing in your json)

{
    "default": {
        "jobs": [
          {
          "name": "A multitask job",
          "tasks": [
            {"task_key": "Sessionize",
             "description": "Extracts session data from events",
             "depends_on": [ ]
             "spark_python_task": {
                "python_file": "com.databricks.Sessionize",
                "parameters": ["--data",  "dbfs:/path/to/data.json"]
              }
         ]
....

Thank you Hubert. Happy to say that this example has helped. I was able to figure it out.

Corrected deployment.json:

{
  "default": {
    "jobs": [
      {
        "name": "parameter-test",
        "spark_python_task": {
          "python_file": "parameter-test.py",
          "parameters": [
            "test1",
            "test2"
          ]
        }
      }
    ]
  }
}

Output of the Python code posted originally, above:

Total arguments passed: 3
Script name python
 
Arguments passed: test1 test2

For some reason, the name of my Python script is returned as just "python", but the actual name is "parameter-test.py". Any idea why Databricks/DBX does that? Any way to get the actual script name from sys.argv[0]?

P.S. Again, there are not enough clear, working examples in the manuals (just a feedback, take it FWIW).

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group