Databricks Community

BjarkeM · ‎12-18-2023

Goal

We would like to be in control of schema migrations of delta tables in all dev and production environments, and it must be automatically deployed.

I anticipated this to be a common problem with a well-known standard solution. But unfortunately, I haven't come across a good solution yet.

Current Solution

In our project we have created a custom tool/concept where SQL scripts are named in a way that ensures that they are executed in the correct order. In each environment, we store a file keeping track of which of the migrations (each script) has been executed. During deployment in CD, each migration that hasn't previously been executed is executed.

The Problem

1. The first problem is that we'd like a standard solution.

2. The second problem is that the metadata is lost when the Databricks (for any reason) is recreated. This is a problem until we eventually switch from hive metastore to Unity Catalogue. In this situation a migration renaming a column using the column mapping feature will be registered as being executed, but the column name change is lost.

What are best practices or recommendations for these problems?

RyanD-AgCountry · ‎12-18-2023

Have you explored the use of Terraform to help manage deployments? I believe that maybe what you're looking for. We use environment variables in project folders that drive what parameters are used, including the pipelines/dlt schemas.

BjarkeM · ‎12-18-2023

Thx @RyanD-AgCountry We use terraform. Can you elaborate on your solution?

RyanD-AgCountry · ‎12-18-2023

Sure, I'll do my best! While I'm unable to share exact internal code, I can reference boilerplate info.

The Terraform provider for databricks_pipeline has a target parameter, where you can point the pipeline to the intended catalog location. Our .tf file for pipelines uses something along the lines of this:

resource "databricks_pipeline" "pipeline" {
    provider = databricks.workspace
    for_each = var.pipelines
    name = each.key

    development = var.dlt_development
    continuous = false
    channel = "PREVIEW"
    photon = true
    edition = "ADVANCED"
    catalog = "${var.env}_${each.value["catalog"]}"
    target = each.value["target"]

    configuration = {
        "pipelines.clusterShutdown.delay" = "10m"
    }
    cluster {
        label = "default"
        autoscale {
            min_workers = each.value["minWorkers"]
            max_workers = each.value["maxWorkers"]
            mode = "ENHANCED"
        }
    }
    library {
        notebook {
            path = each.value["notebookpath"]
        }
    }
}

Then, when our devops CICD yml script runs, we call the resulting .tfvars files to pull in variables and parameters. In the case of pipelines, we have standard json notation similar to this:

pipelines = {
    "pipeline1" = {
        catalog = "bronze"
        target = "database1"
        notebookpath = "/Repos/Engineering/project/notebook..."
    }
}

BjarkeM · ‎12-19-2023

Thx again @RyanD-AgCountry I, however, don't understand how it is a solution to our problem. I'm talking about when - say - a delta table schema changes and is being migrated in the CD together with the deployment of the updated code working on the table with the new schema.

And as we not yet use UC the delta table meta data - like column name mappings executed as part of the migrations - are lost when/if the Databricks workspace is re-established.

RyanD-AgCountry · ‎12-19-2023

I apologize, I misunderstood the problem. I don't believe there is any legacy hive_metastore support for migrations of schemas between environments at the workspace level.

I do believe that UC and a combination use of live tables, schema evolution, and terraform may help once you get there, or it may help rebase the approach you are taking.

zerobugs · ‎03-15-2024

Hello, so does this mean that it's necessary to migrate away from hive_metastore to unity_catalog in order to be able to use schema migrations?

MikaelB · ‎07-29-2024

Hello,

Same question here, and I couldn't find much content about how to handle schema changes deployment.

I'm relatively new to databricks so there may be some obvious functionalities that I'm missing but let's say I need to deploy some schema changes like adding columns to a table. I can add the code in dev but this will only need to run once for schema changes and not everytime the job needs to run, and I don't want to handle that by either running the command manually in prod or managing errors or if statements in the job.

What are the best practices for that ? Having a specific, or a specific job in a repo that is not scheduled but runs once ?

Thanks !

zerobugs · ‎08-05-2024

hey everyone!
coming back with an update: we ended up migrating all our previous SQL scripts to a DBT project (https://www.getdbt.com/) - the reason for this was explicitly to make it easy to handle schema migrations in a hive metastore (we can't/won't migrate to unity catalog on a short notice).

What got me sold on DTB (the solution was proposed by someone smarter than me :)) was the fact that you can basically instruct it to handle migrations itself in a one-liner:

+on_schema_change: "sync_all_columns"

Hope this helps & good luck!