Databricks Community

sandy311 · ‎09-06-2024

When deploying multiple jobs using the `Databricks.yml` file via the asset bundle, the process either overwrites the same job or renames it, instead of creating separate, distinct jobs.

sandeepsharma

filipniziol · ‎09-08-2024

Hi @sandy311 ,

Testing on production is generally not recommended. The ideal approach is to have separate environments, such as Dev, PreProd, and Prod, which allow for thorough testing before any changes are deployed to production.

Assuming you are deploying your pull request to a "target" environment (which could be production or another environment), here are two strategies you can use:

Strategy 1: Use a "Pre-Target" Environment for Testing

1. Create a Pre-Target Environment: Set up a testing environment (PreProd) that closely mirrors your target environment (Prod or another critical environment).
2. Deploy the Pull Request to Pre-Target: Deploy changes from the pull request to the Pre-Target environment.
3. Run Tests: Execute your job tests in the Pre-Target environment to ensure that the changes work as expected.
4. Deploy to Target if Tests Pass: If the tests are successful, proceed to deploy the changes to the target environment.

Strategy 2: Use a Rollback Mechanism in the Target Environment

1. Deploy to Target: Deploy the changes directly to the target environment.
2. Run Tests on Target: Execute job tests in the target environment.
3. Handle Results:
- If Tests Pass: Keep the deployed changes.
- If Tests Fail: Roll back to the last known good configuration (e.g., main branch, previous release).

These strategies help maintain stability in your production-like environment while ensuring your new code is tested thoroughly before any critical deployment.

In summary you do not want to keep multiple versions of the same job (stable-old and untested-new) on the same environment. The best practice here is to have a separate environment to test your pull request.

View solution in original post

filipniziol · ‎09-07-2024

Hi @sandy311 ,
could you share your databricks.yml file?

Are you sure you used unique job ids when defining your jobs?

sandy311 · ‎09-08-2024

My issue is when I update a job with a different name, it's only overriding the existing job instead of creating a new one using asset bundles.

Example:

Step 1:

This is my YAML file, and when I deploy it using asset bundles, it creates a job:

bundle:
  name: test-bundle

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

resources:
  jobs:
    wheel-job:
      job_clusters:
        - job_cluster_key: sample-cluster
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_DS4_v2
            driver_node_type_id: Standard_DS4_v2

Step 2:

When I update the job name or bundle name, it still updates the same job by changing its name. However, what I want is to create a new job if a new job or bundle name is provided, instead of overriding the existing one.

sandeepsharma

filipniziol · ‎09-08-2024

Hi @sandy311

The behavior yo're observing with Databricks asset bundles is expected because asset bundles are designed to update existing jobs when the configuration or content changes. When you use a job like wheel-job, the asset bundle identifies it as the same job and will updateit.

Key Points:

Asset Bundle Purpose: The primary function of Databricks asset bundles is to manage job configurations consistently. When a job with the same id already exists, the asset bundle will update that job rather than creating a new one.
Job Identification: Jobs are identified by their ids. If you use the same job id (wheel-job), the asset bundle treats it as the same job and will overwrite it.

Solution:

If you want to create a new job without overwriting the existing one, you should define distinct job names in your databricks.yml file. For example, if you wish to retain the old job while creating a new one, you could name the jobs differently, like wheel-job and wheel-job-v2 (or better a meaningful name).

Here's an updated example of how you can define multiple jobs in your databricks.yml file:

bundle:
  name: test-bundle

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

resources:
  jobs:
    # Define the original job
    wheel-job:
      job_clusters:
        - job_cluster_key: sample-cluster
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_DS4_v2
            driver_node_type_id: Standard_DS4_v2

    # Define a new job with a different name to avoid overwriting
    wheel-job-v2:
      job_clusters:
        - job_cluster_key: sample-cluster
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_DS4_v2
            driver_node_type_id: Standard_DS4_v2

Explanation:

By defining wheel-job and wheel-job-v2 as separate jobs, the asset bundle will create both jobs independently without one overwriting the other.
If you need to update an existing job, keep the id consistent; if you need a new job, define a new unique id for it.

This approach will allow you to keep your old jobs while also creating new ones as required, using the asset bundles efficiently without conflicts.

sandy311 · ‎09-08-2024

I was expecting the same behavior, and I have already tried the scenario you mentioned. I believe it's due to the asset bundle, and that's acceptable.

I'm trying to run integration tests when a pull request is created. The goal is to run the entire job before merging and deployment to ensure the job works correctly with the new code. I was also attempting to parameterize this process by passing variables when generating the PR so the integration job would run. After the merge, the new code would execute. However, I think we cannot parameterize the databricks.yaml file, and this presents a challenge.

any suggestions from your side? or best practices?

sandeepsharma

filipniziol · ‎09-08-2024

Hi @sandy311 ,

Testing on production is generally not recommended. The ideal approach is to have separate environments, such as Dev, PreProd, and Prod, which allow for thorough testing before any changes are deployed to production.

Assuming you are deploying your pull request to a "target" environment (which could be production or another environment), here are two strategies you can use:

Strategy 1: Use a "Pre-Target" Environment for Testing

1. Create a Pre-Target Environment: Set up a testing environment (PreProd) that closely mirrors your target environment (Prod or another critical environment).
2. Deploy the Pull Request to Pre-Target: Deploy changes from the pull request to the Pre-Target environment.
3. Run Tests: Execute your job tests in the Pre-Target environment to ensure that the changes work as expected.
4. Deploy to Target if Tests Pass: If the tests are successful, proceed to deploy the changes to the target environment.

Strategy 2: Use a Rollback Mechanism in the Target Environment

1. Deploy to Target: Deploy the changes directly to the target environment.
2. Run Tests on Target: Execute job tests in the target environment.
3. Handle Results:
- If Tests Pass: Keep the deployed changes.
- If Tests Fail: Roll back to the last known good configuration (e.g., main branch, previous release).

These strategies help maintain stability in your production-like environment while ensuring your new code is tested thoroughly before any critical deployment.

In summary you do not want to keep multiple versions of the same job (stable-old and untested-new) on the same environment. The best practice here is to have a separate environment to test your pull request.

sandy311 · ‎09-08-2024

@filipniziolThank you for your valuable feedback. I believe the second approach aligns well with my requirements. I will implement a rollback mechanism in the target environment. Currently, I am performing all these tasks in the development environment.

sandeepsharma

Ncolin1999 · ‎10-23-2024

@filipniziol my requirements is to just deploy notebooks in databricks workspace. I don’t not wana create any job. Can I still uses databricks asset bundle