12-18-2023 06:26 AM
Goal
We would like to be in control of schema migrations of delta tables in all dev and production environments, and it must be automatically deployed.
I anticipated this to be a common problem with a well-known standard solution. But unfortunately, I haven't come across a good solution yet.
Current Solution
In our project we have created a custom tool/concept where SQL scripts are named in a way that ensures that they are executed in the correct order. In each environment, we store a file keeping track of which of the migrations (each script) has been executed. During deployment in CD, each migration that hasn't previously been executed is executed.
The Problem
1. The first problem is that we'd like a standard solution.
2. The second problem is that the metadata is lost when the Databricks (for any reason) is recreated. This is a problem until we eventually switch from hive metastore to Unity Catalogue. In this situation a migration renaming a column using the column mapping feature will be registered as being executed, but the column name change is lost.
What are best practices or recommendations for these problems?
12-18-2023 06:33 AM
Have you explored the use of Terraform to help manage deployments? I believe that maybe what you're looking for. We use environment variables in project folders that drive what parameters are used, including the pipelines/dlt schemas.
12-18-2023 06:43 AM
Thx @RyanD-AgCountry We use terraform. Can you elaborate on your solution?
12-18-2023 06:51 AM
Sure, I'll do my best! While I'm unable to share exact internal code, I can reference boilerplate info.
The Terraform provider for databricks_pipeline has a target parameter, where you can point the pipeline to the intended catalog location. Our .tf file for pipelines uses something along the lines of this:
resource "databricks_pipeline" "pipeline" {
provider = databricks.workspace
for_each = var.pipelines
name = each.key
development = var.dlt_development
continuous = false
channel = "PREVIEW"
photon = true
edition = "ADVANCED"
catalog = "${var.env}_${each.value["catalog"]}"
target = each.value["target"]
configuration = {
"pipelines.clusterShutdown.delay" = "10m"
}
cluster {
label = "default"
autoscale {
min_workers = each.value["minWorkers"]
max_workers = each.value["maxWorkers"]
mode = "ENHANCED"
}
}
library {
notebook {
path = each.value["notebookpath"]
}
}
}
Then, when our devops CICD yml script runs, we call the resulting .tfvars files to pull in variables and parameters. In the case of pipelines, we have standard json notation similar to this:
pipelines = {
"pipeline1" = {
catalog = "bronze"
target = "database1"
notebookpath = "/Repos/Engineering/project/notebook..."
}
}
12-19-2023 09:26 AM
Thx again @RyanD-AgCountry I, however, don't understand how it is a solution to our problem. I'm talking about when - say - a delta table schema changes and is being migrated in the CD together with the deployment of the updated code working on the table with the new schema.
And as we not yet use UC the delta table meta data - like column name mappings executed as part of the migrations - are lost when/if the Databricks workspace is re-established.
12-19-2023 11:37 AM
I apologize, I misunderstood the problem. I don't believe there is any legacy hive_metastore support for migrations of schemas between environments at the workspace level.
I do believe that UC and a combination use of live tables, schema evolution, and terraform may help once you get there, or it may help rebase the approach you are taking.
03-15-2024 03:35 AM
Hello, so does this mean that it's necessary to migrate away from hive_metastore to unity_catalog in order to be able to use schema migrations?
07-29-2024 05:59 AM
Hello,
Same question here, and I couldn't find much content about how to handle schema changes deployment.
I'm relatively new to databricks so there may be some obvious functionalities that I'm missing but let's say I need to deploy some schema changes like adding columns to a table. I can add the code in dev but this will only need to run once for schema changes and not everytime the job needs to run, and I don't want to handle that by either running the command manually in prod or managing errors or if statements in the job.
What are the best practices for that ? Having a specific, or a specific job in a repo that is not scheduled but runs once ?
Thanks !
08-05-2024 07:37 AM
hey everyone!
coming back with an update: we ended up migrating all our previous SQL scripts to a DBT project (https://www.getdbt.com/) - the reason for this was explicitly to make it easy to handle schema migrations in a hive metastore (we can't/won't migrate to unity catalog on a short notice).
What got me sold on DTB (the solution was proposed by someone smarter than me :)) was the fact that you can basically instruct it to handle migrations itself in a one-liner:
10-09-2024 07:39 AM
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group