cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DAB git - sometimes doesn't see modules

pepco
New Contributor III

We are using DABs to deploy our jobs. DABs have source set to git branch or git tag depending on the environment.  Repository is structured in mono repo fashion. We don't use wheels for our modules. Sometimes when the jobs run they "randomly" fail that some module is not found; i.e. "ModuleNotFoundError: No module named 'lib'". The restart runs without any issues. 

I'm trying to understand what's happening, but it looks like PYTHONPATH is sometimes not set correctly. 

Did anyone see this behavior?

7 REPLIES 7

Sumit_7
Honored Contributor III

@pepco Would you mind sharing your DAB yaml (hiding secrets)?

pepco
New Contributor III

job.yml

resources:
  jobs:
    pdm_general_ledger_details_hub_job:
      name: org_team_pdm_general_ledger_details_hub_job
      description: "Load General Ledger Details reference hub table"
      email_notifications:
        on_failure:
          - dl@redacted.com
      performance_target: PERFORMANCE_OPTIMIZED
      tasks:
        - task_key: load_ref_pah_general_ledger_details_hub
          timeout_seconds: "${var.timeout}"
          notebook_task:
            notebook_path: notebooks/run_query
            base_parameters:
              pipeline_name: ref_pah_general_ledger_details_hub
              target_table: schema.table
              query_file: ../pipelines/pdm_general_ledger_details_hub/src/load_ref_pah_general_ledger_details_hub.sql
            source: GIT
      git_source:
        git_url: https://github.com/redacted/redacted.git
        git_provider: gitHub
      tags:
        app-ci-id: ${var.configuration_item}
        job-type: child

databricks.yml

bundle:
name: pdm_ref_general_ledger_details

# These are any additional configuration files to include.
include:
- bundle/jobs/*.yml
- bundle/variables/*.yml

non_production_job_permissions: &non_prod_job_permissions
permissions:
- level: CAN_MANAGE
group_name: redacted
- level: CAN_MANAGE
service_principal_name: ${var.service_account}

production_job_permissions: &prod_job_permissions
permissions:
- level: CAN_MANAGE_RUN
group_name: redacted
- level: CAN_MANAGE
service_principal_name: ${var.service_account}
- level: CAN_MANAGE
service_principal_name: ${var.snow_service_account}

non_production_job_notifications: &non_prod_job_notifications
email_notifications:
on_failure:
- dl@redacted

production_job_notifications: &prod_job_notifications
email_notifications:
on_failure:
- dl@redacted
webhook_notifications:
on_failure:
- id: ${var.snow_webhook_id}

targets:
test:
mode: production
default: false
presets:
trigger_pause_status: UNPAUSED
jobs_max_concurrent_runs: 1
workspace:
host: https://redacted.cloud.databricks.com
root_path: /Workspace/org/team/.bundle/${bundle.target}/${var.developer_id}/${bundle.name}
resources:
jobs:
pdm_general_ledger_details_hub_job:
git_source:
git_branch: ${var.git_branch}
<<:
- *non_prod_job_permissions
- *non_prod_job_notifications
variables:
uc_catalog:
description: Unity Catalog prefix.
default: "tst"
configuration_item:
default: redacted

amirabedhiafi
New Contributor II

Hi again @pepco as I explained in my answer below, the issue is probably caused by using source: GIT / git_source together with DABs. DBKS does not recommend this pattern for bundles, because the job runs from Git at runtime instead of from the workspace files deployed by the bundle.

In a mono repo, this can make relative imports like lib unreliable. You should remove git_source and source: GIT, deploy the code with the bundle, use workspace source paths, and include shared folders through sync.paths. Also don't forget to make the repo root explicit in sys.path or package the shared code as a wheel.

If this answer resolves your question, could you please mark it as โ€œAccept as Solutionโ€? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP

pepco
New Contributor III

I'm sorry I'm not ready to accept this as a solution. I'm not saying you are not right though. The documentation is not clear on this, or I would say that there are contraindications.

Add tasks to jobs in Declarative Automation Bundles | Databricks on AWS:

"..., because local relative paths may not point to the same content in the Git repository."

I'm not using relative imports to import my shared modules.

"Instead, clone the repository locally and set up your bundle project within this repository, so that the source for tasks are the workspace."

It was my understanding that when job starts it clones the repository locally to the cluster and therefore it should behave correctly: Use Git with Lakeflow Jobs | Databricks on AWS 

amirabedhiafi
New Contributor II

Hello @pepco !

I will share with you my personal experience about a very similar behaviour I got like you.

If you check DBKS doc you will find that  git_source and task source: GIT are not recommended for DAB because local relative paths may not point to the same content in the git repo and bundles expect the deployed job to run from the same files that were deployed from the local bundle copy. 

You need to use workspace source for bundle tasks instead. https://docs.databricks.com/aws/en/dev-tools/bundles/job-task-types

In my case, this is what I had :

mono repo
DAB deployment
source = Git branch/tag
custom local modules
no wheels
imports like: import lib

I understood at that tome that combination can work most of the time but it depends heavily on what DBKS puts into cwd or  sys.path for that specific task run.

So I took time to understand what is happening behind and in reality when a task uses git source, DBKS retrieves the notebook or python file from the git repo at runtime. For python script tasks, DBKS says that workspace paths must be absolute, while git paths are relative so if source is empty the task uses GIT when git_source is defined. 

My task file was found but my shared lib folder is not consistently on sys.path, so python was failing with the famous error no module named lib ....

A retry can succeed because the run may start in a slightly different initialized state or because the cluster already has a path state from a previous successful run. That does not mean the setup is deterministic.

What I have done so far is I started avoiding runtime git source for python imports and used source: WORKSPACE

(you can also remove git_source and let the bundle deploy the code into the workspace that works fine)

If you have a mono repo, you can use sync.paths so the shared code is deployed together with the bundle.

If this answer resolves your question, could you please mark it as โ€œAccept as Solutionโ€? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP

pepco
New Contributor III
I'm aware of the description in documentation. I observed this problem only with serverless clusters. With job clusters it never failed in 16+ months we use bundles with git.

If I move to the workspace, then I would need to add workspace path into the PATH variable (according to all posts on the forum) which brings other problems on the table.

amirabedhiafi
New Contributor II

I think you can only observe this on serverless compute. The same DAB with git source setup has been stable on job clusters for over a year so what I understand is that the issue is with how the git repo root is added to python import path. As a workaround resolve the repo root from Path.cwd() and add it to sys.path at the start of the notebook instead of hardcoding /Workspace/... path or moving everything to workspace source.

If this answer resolves your question, could you please mark it as โ€œAccept as Solutionโ€? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP