cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Schema migration of production delta tables

BjarkeM
New Contributor II

Goal

We would like to be in control of schema migrations of delta tables in all dev and production environments, and it must be automatically deployed.

I anticipated this to be a common problem with a well-known standard solution. But unfortunately, I haven't come across a good solution yet.

Current Solution

In our project we have created a custom tool/concept where SQL scripts are named in a way that ensures that they are executed in the correct order. In each environment, we store a file keeping track of which of the migrations (each script) has been executed. During deployment in CD, each migration that hasn't previously been executed is executed.

The Problem

1. The first problem is that we'd like a standard solution.

2. The second problem is that the metadata is lost when the Databricks (for any reason) is recreated. This is a problem until we eventually switch from hive metastore to Unity Catalogue. In this situation a migration renaming a column using the column mapping feature will be registered as being executed, but the column name change is lost.

What are best practices or recommendations for these problems?

6 REPLIES 6

RyanD-AgCountry
Contributor

Have you explored the use of Terraform to help manage deployments? I believe that maybe what you're looking for. We use environment variables in project folders that drive what parameters are used, including the pipelines/dlt schemas.

BjarkeM
New Contributor II

Thx @RyanD-AgCountry We use terraform. Can you elaborate on your solution?

Sure, I'll do my best! While I'm unable to share exact internal code, I can reference boilerplate info.

The Terraform provider for databricks_pipeline has a target parameter, where you can point the pipeline to the intended catalog location. Our .tf file for pipelines uses something along the lines of this:

 

resource "databricks_pipeline" "pipeline" {
    provider = databricks.workspace
    for_each = var.pipelines
    name = each.key

    development = var.dlt_development
    continuous = false
    channel = "PREVIEW"
    photon = true
    edition = "ADVANCED"
    catalog = "${var.env}_${each.value["catalog"]}"
    target = each.value["target"]

    configuration = {
        "pipelines.clusterShutdown.delay" = "10m"
    }
    cluster {
        label = "default"
        autoscale {
            min_workers = each.value["minWorkers"]
            max_workers = each.value["maxWorkers"]
            mode = "ENHANCED"
        }
    }
    library {
        notebook {
            path = each.value["notebookpath"]
        }
    }
}

Then, when our devops CICD yml script runs, we call the resulting .tfvars files to pull in variables and parameters. In the case of pipelines, we have standard json notation similar to this:

pipelines = {
    "pipeline1" = {
        catalog = "bronze"
        target = "database1"
        notebookpath = "/Repos/Engineering/project/notebook..."
    }
}

 

BjarkeM
New Contributor II

Thx again @RyanD-AgCountry I, however, don't understand how it is a solution to our problem. I'm talking about when - say - a delta table schema changes and is being migrated in the CD together with the deployment of the updated code working on the table with the new schema.

And as we not yet use UC the delta table meta data - like column name mappings executed as part of the migrations - are lost when/if the Databricks workspace is re-established.

I apologize, I misunderstood the problem. I don't believe there is any legacy hive_metastore support for migrations of schemas between environments at the workspace level.

I do believe that UC and a combination use of live tables, schema evolution, and terraform may help once you get there, or it may help rebase the approach you are taking.

zerobugs
New Contributor II

Hello, so does this mean that it's necessary to migrate away from hive_metastore to unity_catalog in order to be able to use schema migrations?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.