Sunday
We are using DABs to deploy our jobs. DABs have source set to git branch or git tag depending on the environment. Repository is structured in mono repo fashion. We don't use wheels for our modules. Sometimes when the jobs run they "randomly" fail that some module is not found; i.e. "ModuleNotFoundError: No module named 'lib'". The restart runs without any issues.
I'm trying to understand what's happening, but it looks like PYTHONPATH is sometimes not set correctly.
Did anyone see this behavior?
Sunday
@pepco Would you mind sharing your DAB yaml (hiding secrets)?
yesterday
job.yml
resources:
jobs:
pdm_general_ledger_details_hub_job:
name: org_team_pdm_general_ledger_details_hub_job
description: "Load General Ledger Details reference hub table"
email_notifications:
on_failure:
- dl@redacted.com
performance_target: PERFORMANCE_OPTIMIZED
tasks:
- task_key: load_ref_pah_general_ledger_details_hub
timeout_seconds: "${var.timeout}"
notebook_task:
notebook_path: notebooks/run_query
base_parameters:
pipeline_name: ref_pah_general_ledger_details_hub
target_table: schema.table
query_file: ../pipelines/pdm_general_ledger_details_hub/src/load_ref_pah_general_ledger_details_hub.sql
source: GIT
git_source:
git_url: https://github.com/redacted/redacted.git
git_provider: gitHub
tags:
app-ci-id: ${var.configuration_item}
job-type: child
databricks.yml
bundle:
name: pdm_ref_general_ledger_details
# These are any additional configuration files to include.
include:
- bundle/jobs/*.yml
- bundle/variables/*.yml
non_production_job_permissions: &non_prod_job_permissions
permissions:
- level: CAN_MANAGE
group_name: redacted
- level: CAN_MANAGE
service_principal_name: ${var.service_account}
production_job_permissions: &prod_job_permissions
permissions:
- level: CAN_MANAGE_RUN
group_name: redacted
- level: CAN_MANAGE
service_principal_name: ${var.service_account}
- level: CAN_MANAGE
service_principal_name: ${var.snow_service_account}
non_production_job_notifications: &non_prod_job_notifications
email_notifications:
on_failure:
- dl@redacted
production_job_notifications: &prod_job_notifications
email_notifications:
on_failure:
- dl@redacted
webhook_notifications:
on_failure:
- id: ${var.snow_webhook_id}
targets:
test:
mode: production
default: false
presets:
trigger_pause_status: UNPAUSED
jobs_max_concurrent_runs: 1
workspace:
host: https://redacted.cloud.databricks.com
root_path: /Workspace/org/team/.bundle/${bundle.target}/${var.developer_id}/${bundle.name}
resources:
jobs:
pdm_general_ledger_details_hub_job:
git_source:
git_branch: ${var.git_branch}
<<:
- *non_prod_job_permissions
- *non_prod_job_notifications
variables:
uc_catalog:
description: Unity Catalog prefix.
default: "tst"
configuration_item:
default: redacted
7 hours ago
Hi again @pepco as I explained in my answer below, the issue is probably caused by using source: GIT / git_source together with DABs. DBKS does not recommend this pattern for bundles, because the job runs from Git at runtime instead of from the workspace files deployed by the bundle.
In a mono repo, this can make relative imports like lib unreliable. You should remove git_source and source: GIT, deploy the code with the bundle, use workspace source paths, and include shared folders through sync.paths. Also don't forget to make the repo root explicit in sys.path or package the shared code as a wheel.
5 hours ago
I'm sorry I'm not ready to accept this as a solution. I'm not saying you are not right though. The documentation is not clear on this, or I would say that there are contraindications.
Add tasks to jobs in Declarative Automation Bundles | Databricks on AWS:
"..., because local relative paths may not point to the same content in the Git repository."
I'm not using relative imports to import my shared modules.
"Instead, clone the repository locally and set up your bundle project within this repository, so that the source for tasks are the workspace."
It was my understanding that when job starts it clones the repository locally to the cluster and therefore it should behave correctly: Use Git with Lakeflow Jobs | Databricks on AWS
yesterday
Hello @pepco !
I will share with you my personal experience about a very similar behaviour I got like you.
If you check DBKS doc you will find that git_source and task source: GIT are not recommended for DAB because local relative paths may not point to the same content in the git repo and bundles expect the deployed job to run from the same files that were deployed from the local bundle copy.
You need to use workspace source for bundle tasks instead. https://docs.databricks.com/aws/en/dev-tools/bundles/job-task-types
In my case, this is what I had :
mono repo DAB deployment source = Git branch/tag custom local modules no wheels imports like: import lib
I understood at that tome that combination can work most of the time but it depends heavily on what DBKS puts into cwd or sys.path for that specific task run.
So I took time to understand what is happening behind and in reality when a task uses git source, DBKS retrieves the notebook or python file from the git repo at runtime. For python script tasks, DBKS says that workspace paths must be absolute, while git paths are relative so if source is empty the task uses GIT when git_source is defined.
My task file was found but my shared lib folder is not consistently on sys.path, so python was failing with the famous error no module named lib ....
A retry can succeed because the run may start in a slightly different initialized state or because the cluster already has a path state from a previous successful run. That does not mean the setup is deterministic.
What I have done so far is I started avoiding runtime git source for python imports and used source: WORKSPACE
(you can also remove git_source and let the bundle deploy the code into the workspace that works fine)
If you have a mono repo, you can use sync.paths so the shared code is deployed together with the bundle.
yesterday
5 hours ago
I think you can only observe this on serverless compute. The same DAB with git source setup has been stable on job clusters for over a year so what I understand is that the issue is with how the git repo root is added to python import path. As a workaround resolve the repo root from Path.cwd() and add it to sys.path at the start of the notebook instead of hardcoding /Workspace/... path or moving everything to workspace source.