Welcome to Part 2 of this series. If you haven’t read Part 1, we recommend you start there. To recap so far:
Between us, we have worked on over 30 customer migrations leading us to develop this list of high impact blunders that everyone can avoid:
And now, on with the show…
Code is a big piece of the puzzle, yes, but it’s not the only one. There are other data assets to consider, as well as downstream impacts.
Downstream systems are incredibly important if you’re looking to not upset your end users, and doubly so if you’re hoping to decommission something. Here’s some common ones we’ve seen:
This list is not exhaustive, and is related to #2 in Part 1, Not speaking to all you user groups.
There are also non-data assets closer to home. Some of these are binary considerations (do we move them or not?), whereas some are more open-ended.
In the binary yes/no considerations we have:
For the more open-ended activities we have:
Immediate action: Spotted a few gaps in the plans? Now is the time to add them, or at least mark them as risks.
Future action: The more open-ended activities will need refinement over time as the usage of any platform changes. Start planning time now to incorporate new features. Databricks moves fast, so this should happen ideally every 6-9 months. Please don’t leave it for more than a year!
If we had a penny for every story of a team who realised they had to destroy their Databricks workspace to change their VNETs, it wouldn’t be much of a bonus, but it would be enough to share a 10p Freddo bar.
Fixing security is much harder to do retrospectively.
Each of these topics can be a fractal. This blog isn’t a replacement for the documentation, but it is more to raise awareness of what needs to be considered.
Isolation of data and assets - Isolation has varying levels of severity to it. One IT team might serve two distinct legal entities with different regulations. Imagine a company working in both Swiss markets and Chinese markets; this would potentially require the most extreme levels of isolation down to cloud accounts. A much softer version would be two teams that don’t often overlap, but occasionally they’d need to collaborate. These teams could share a metastore, but separate workspaces, and only certain catalogs to be accessed by both.
Who’s an admin - Databricks has admins for the account, the workspace, the marketplace and the metastores, all of whom can be different people. Similar to admins, there are owners for database items like catalogs, schemas, tables, models etc. Then there are workspace items like jobs, dashboards and notebooks.
Whilst we can tell you making everyone an admin or owner is a bad idea, so is having so few that holiday season grounds progress to a halt. The sweet spot is down to you to decide.
Mapping Users, Groups & Principals to workspace objects - tedious, but necessary. Similar to the above with owners, but broader. Who can use which clusters in which ways, managing jobs and workflows, alerting, secret scopes, models, dashboards, folders and notebooks …the list goes on.
Onboarding or offboarding - nothing will dampen end user enthusiasm like a 10 working day wait to get access to a platform. Except perhaps a cumbersome authentication process. As of time of writing Single Sign On is required for Unity Catalog enabled workspaces. Either way SCIM also needs to be set up, although we are looking to simplify this in future with new features. You can find the latest guidance in the documentation here.
Data Access - Not just the high level of catalogs and schemas, but down to the data inside the tables themselves like PII. Some of this is faster implementation of features like masking, but it starts becoming more complex (and time consuming) with tokenisation strategies.
Immediate action: Walk through access journeys either hypothetically or with users, and map that to access they’d need. Now for the fun counter argument; think of the worst case scenarios of data breaches and work backwards. People are much better at describing what they don’t want to happen than what they do.
Future action: The good news is that there’s not much here except keeping an eye out for new security features that might make people’s lives easier.
This is much harder to quantify as a checklist, but we’d categorise it as “missing the wood through the trees”. Here are some examples we’ve experienced:
The original ask: Migrating three separate identical pipelines that all did the same thing to be used by three separate teams
The solution: Have one pipeline with access to three separate teams
The original ask: Migrate to new dashboards …that the end user exported to make their own custom excel
The solution: Write the results directly to excel
The original ask: Make the new pipelines faster with bigger clusters
The solution: Remove 95% of the redundancy by switching to incremental processing
The original ask: Optimize the longest running pipelines to save money
The solution: Optimize the most expensive pipelines to save money
Wherever you’re reading this blog, please add your own story in the comments section in hope that it might help people identify areas people might be going wrong.
It would be naive to advise “don’t do this” because it overlooks the unique position we as Databricks find ourselves in. We can ask awkward questions because we’re not beholden to a hierarchy or politics and are reporting to the contract signer who tends to be removed from the detail.
Money.
In order to save money on their migrations, people look to cut corners; but this creates a false economy. “Buy cheap, buy twice” doesn’t just apply to fast fashion, it applies to platform migrations.
A further aspect of this is the budget put into ongoing maintenance. If you wanted your expensive shoes to last longer, you’d have to take regular care of them. Again, the same is true for any platform, not just Databricks. I know this can be a hard sell when the overall goal might be cost savings, but dedicating a few months in the year to improvements is vastly cheaper than a wholesale migration.
If you’re having a hard time articulating this number, it’s easier to sell with long run platform longevity estimates.
Platform life expectancy |
3 years |
6 years |
Migration cost |
£1m |
£1m |
Averaged migration cost per year of life |
£333k |
£116k |
Maintenance months per year |
0 |
2 |
Maintenance costs @ £50k per month |
£0 |
£100k |
Total |
£333k |
£216k |
Migration costs are predominantly made up of people time. Not just the small army of developers, but project managers, change management, enablement and onboarding teams.
The cheapest migration is the one that’s not needed.
Well done for making it to the end of this blog, we appreciate it’s not a light read. Here are summaries of all the actions we recommend you take in the short and long term.
Immediate actions:
Future plans. Schedule time in the future to regularly revisit:
We hope this has given you a few ideas for your migrations. Catching things early will increase the odds of success, and hopefully give your career a boost at the same time. If you’ve incorporated a few things into your plans, let us know!
Some of you reading this may be overwhelmed by a sinking feeling as you realise how many gaps there might be in your current plans. First of all, deep breath, it’s a good thing to learn this now instead of a week from the deadline. Second, help is out there. If you’d like help from a third party who has been through it before, Databricks has many consulting partners as well as it’s own consulting services.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.