I am trying to run a multi file python job in databricks without using notebooks. I have tried setting this up by:
- creating a docker image using the DBRT 10.4 LTS as a base and adding the zipped python application to that.
- make a call to the run submit endpoint with this payload:
{
"tasks": {
"task_key": "test-run-8",
"spark_submit_task": {
"parameters": [
"--py-files",
"/app.zip",
"/app.zip/__main__.py"
]
},
"new_cluster": {
"num_workers": 1,
"spark_version": "11.1.x-scala2.12",
"aws_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "us-west-2a",
"instance_profile_arn": "<instance profile ARN>",
"spot_bid_price_percent": 100,
"ebs_volume_count": 0
},
"node_type_id": "i3.xlarge",
"docker_image": {
"url": "<aws-account-number>.dkr.ecr.us-west-2.amazonaws.com/spark-app:0.1.17"
}
}
}
}
The application tries to read a JSON file and load it into a new delta lake table. Unfortunately this does not work as intended. Here is what I have found:
- When I run the code in the application out of a notebook, it works normally
- when running using the the jobs endpoint I don't see the table at all in the databricks UI
- When checking the S3 bucket I do see a folder created for the database and some parquet files of the table.
- running any queries on this table fails with a not found error.
- when checking the driver logs I see the following:
...
22/08/19 02:38:28 WARN DefaultTableOwnerAclClient: failed to update the table owner when create/drop table.
java.lang.IllegalStateException: Driver context not found
...
some extra considerations:
- I'd like to be able to use docker images for our deployment if possible since it matches our current CI/CD pattern
- Failing that I'd be ok with a spark_python_task, but I have not been able to get this to work when I have multiple python files
- I want to avoid using notebooks for deploying applications.
Any help with understanding and fixing this error would be very appreciated.
Regards,
Umar