cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Schema migration of production delta tables

BjarkeM
New Contributor II

Goal

We would like to be in control of schema migrations of delta tables in all dev and production environments, and it must be automatically deployed.

I anticipated this to be a common problem with a well-known standard solution. But unfortunately, I haven't come across a good solution yet.

Current Solution

In our project we have created a custom tool/concept where SQL scripts are named in a way that ensures that they are executed in the correct order. In each environment, we store a file keeping track of which of the migrations (each script) has been executed. During deployment in CD, each migration that hasn't previously been executed is executed.

The Problem

1. The first problem is that we'd like a standard solution.

2. The second problem is that the metadata is lost when the Databricks (for any reason) is recreated. This is a problem until we eventually switch from hive metastore to Unity Catalogue. In this situation a migration renaming a column using the column mapping feature will be registered as being executed, but the column name change is lost.

What are best practices or recommendations for these problems?

9 REPLIES 9

RyanD-AgCountry
Contributor

Have you explored the use of Terraform to help manage deployments? I believe that maybe what you're looking for. We use environment variables in project folders that drive what parameters are used, including the pipelines/dlt schemas.

BjarkeM
New Contributor II

Thx @RyanD-AgCountry We use terraform. Can you elaborate on your solution?

Sure, I'll do my best! While I'm unable to share exact internal code, I can reference boilerplate info.

The Terraform provider for databricks_pipeline has a target parameter, where you can point the pipeline to the intended catalog location. Our .tf file for pipelines uses something along the lines of this:

 

resource "databricks_pipeline" "pipeline" {
    provider = databricks.workspace
    for_each = var.pipelines
    name = each.key

    development = var.dlt_development
    continuous = false
    channel = "PREVIEW"
    photon = true
    edition = "ADVANCED"
    catalog = "${var.env}_${each.value["catalog"]}"
    target = each.value["target"]

    configuration = {
        "pipelines.clusterShutdown.delay" = "10m"
    }
    cluster {
        label = "default"
        autoscale {
            min_workers = each.value["minWorkers"]
            max_workers = each.value["maxWorkers"]
            mode = "ENHANCED"
        }
    }
    library {
        notebook {
            path = each.value["notebookpath"]
        }
    }
}

Then, when our devops CICD yml script runs, we call the resulting .tfvars files to pull in variables and parameters. In the case of pipelines, we have standard json notation similar to this:

pipelines = {
    "pipeline1" = {
        catalog = "bronze"
        target = "database1"
        notebookpath = "/Repos/Engineering/project/notebook..."
    }
}

 

BjarkeM
New Contributor II

Thx again @RyanD-AgCountry I, however, don't understand how it is a solution to our problem. I'm talking about when - say - a delta table schema changes and is being migrated in the CD together with the deployment of the updated code working on the table with the new schema.

And as we not yet use UC the delta table meta data - like column name mappings executed as part of the migrations - are lost when/if the Databricks workspace is re-established.

I apologize, I misunderstood the problem. I don't believe there is any legacy hive_metastore support for migrations of schemas between environments at the workspace level.

I do believe that UC and a combination use of live tables, schema evolution, and terraform may help once you get there, or it may help rebase the approach you are taking.

zerobugs
New Contributor II

Hello, so does this mean that it's necessary to migrate away from hive_metastore to unity_catalog in order to be able to use schema migrations?

MikaelB
New Contributor II

Hello,

Same question here, and I couldn't find much content about how to handle schema changes deployment.

I'm relatively new to databricks so there may be some obvious functionalities that I'm missing but let's say I need to deploy some schema changes like adding columns to a table. I can add the code in dev but this will only need to run once for schema changes and not everytime the job needs to run, and I don't want to handle that by either running the command manually in prod or managing errors or if statements in the job. 

What are the best practices for that ? Having a specific, or a specific job in a repo that is not scheduled but runs once ? 

Thanks !

zerobugs
New Contributor II

hey everyone!
coming back with an update: we ended up migrating all our previous SQL scripts to a DBT project (https://www.getdbt.com/) - the reason for this was explicitly to make it easy to handle schema migrations in a hive metastore (we can't/won't migrate to unity catalog on a short notice).

What got me sold on DTB (the solution was proposed by someone smarter than me :))  was the fact that you can basically instruct it to handle migrations itself in a one-liner:

 

+on_schema_change: "sync_all_columns"
Hope this helps & good luck!

worlordv
New Contributor II

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group