Databricks Community

umarkhan · ‎08-22-2022

I am trying to run a multi file python job in databricks without using notebooks. I have tried setting this up by:

creating a docker image using the DBRT 10.4 LTS as a base and adding the zipped python application to that.
make a call to the run submit endpoint with this payload:

{
    "tasks": {
        "task_key": "test-run-8",
        "spark_submit_task": {
            "parameters": [
                "--py-files",
                "/app.zip",
                "/app.zip/__main__.py"
            ]
        },
        "new_cluster": {
            "num_workers": 1,
            "spark_version": "11.1.x-scala2.12",
            "aws_attributes": {
                "first_on_demand": 1,
                "availability": "SPOT_WITH_FALLBACK",
                "zone_id": "us-west-2a",
                "instance_profile_arn": "<instance profile ARN>",
                "spot_bid_price_percent": 100,
                "ebs_volume_count": 0
            },
            "node_type_id": "i3.xlarge",
            "docker_image": {
                "url": "<aws-account-number>.dkr.ecr.us-west-2.amazonaws.com/spark-app:0.1.17"
            }
        }
    }
}

The application tries to read a JSON file and load it into a new delta lake table. Unfortunately this does not work as intended. Here is what I have found:

When I run the code in the application out of a notebook, it works normally
when running using the the jobs endpoint I don't see the table at all in the databricks UI
When checking the S3 bucket I do see a folder created for the database and some parquet files of the table.
running any queries on this table fails with a not found error.
when checking the driver logs I see the following:

...
22/08/19 02:38:28 WARN DefaultTableOwnerAclClient: failed to update the table owner when create/drop table.
java.lang.IllegalStateException: Driver context not found
...

some extra considerations:

I'd like to be able to use docker images for our deployment if possible since it matches our current CI/CD pattern
Failing that I'd be ok with a spark_python_task, but I have not been able to get this to work when I have multiple python files
I want to avoid using notebooks for deploying applications.

Any help with understanding and fixing this error would be very appreciated.

Regards,

Umar

umarkhan · ‎08-29-2022

Hello @Kaniz Fatma (Databricks), thanks for the response. No, table access control has not been enabled. As I understand it this should allow anyone to access the table by default.

Also, in case it helps, we are using AWS.

Vidula · ‎09-12-2022

Hi @Umar Khan

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!