Databricks Community

holly · ‎09-15-2024

Welcome to Part 2 of this series. If you haven’t read Part 1, we recommend you start there. To recap so far:

Between us, we have worked on over 30 customer migrations leading us to develop this list of high impact blunders that everyone can avoid:

Indiscriminately moving everything, even the that project that was decommissioned 3 years ago
On the flip side, not moving all those projects your entire user base needed (possibly because you didn’t know they existed)
Cut andm pasting the code as is, leaving you unable to take advantage of the new skiny capabilities this new platform gives you

And now, on with the show…

#4 Ignoring things that aren’t code

Code is a big piece of the puzzle, yes, but it’s not the only one. There are other data assets to consider, as well as downstream impacts.

Downstream systems are incredibly important if you’re looking to not upset your end users, and doubly so if you’re hoping to decommission something. Here’s some common ones we’ve seen:

Visualization tools
Analytics and reporting tools
Alerting and notifications
Web apps
Reverse ETL
Data Sharing via marketplaces or file generation
API serving

This list is not exhaustive, and is related to #2 in Part 1, Not speaking to all you user groups.

There are also non-data assets closer to home. Some of these are binary considerations (do we move them or not?), whereas some are more open-ended.

In the binary yes/no considerations we have:

Models
Dashboards
DevOps
Audit logging

For the more open-ended activities we have:

Cost monitoring, including tagging strategies: This isn’t just applying the tags but chatting to the bill payers about how they want to understand their spending data
Optimisations: Many a blog has been written about optimisations, but in the context of migrations: start with the worst offenders, then the next worst, and so on
Governance & Security: (see next point)

Immediate action: Spotted a few gaps in the plans? Now is the time to add them, or at least mark them as risks.

Future action: The more open-ended activities will need refinement over time as the usage of any platform changes. Start planning time now to incorporate new features. Databricks moves fast, so this should happen ideally every 6-9 months. Please don’t leave it for more than a year!

#5 Looking at Data Access and Security as an afterthought

If we had a penny for every story of a team who realised they had to destroy their Databricks workspace to change their VNETs, it wouldn’t be much of a bonus, but it would be enough to share a 10p Freddo bar.

Fixing security is much harder to do retrospectively.

Each of these topics can be a fractal. This blog isn’t a replacement for the documentation, but it is more to raise awareness of what needs to be considered.

Isolation of data and assets - Isolation has varying levels of severity to it. One IT team might serve two distinct legal entities with different regulations. Imagine a company working in both Swiss markets and Chinese markets; this would potentially require the most extreme levels of isolation down to cloud accounts. A much softer version would be two teams that don’t often overlap, but occasionally they’d need to collaborate. These teams could share a metastore, but separate workspaces, and only certain catalogs to be accessed by both.

Who’s an admin - Databricks has admins for the account, the workspace, the marketplace and the metastores, all of whom can be different people. Similar to admins, there are owners for database items like catalogs, schemas, tables, models etc. Then there are workspace items like jobs, dashboards and notebooks.

Whilst we can tell you making everyone an admin or owner is a bad idea, so is having so few that holiday season grounds progress to a halt. The sweet spot is down to you to decide.

Mapping Users, Groups & Principals to workspace objects - tedious, but necessary. Similar to the above with owners, but broader. Who can use which clusters in which ways, managing jobs and workflows, alerting, secret scopes, models, dashboards, folders and notebooks …the list goes on.

Onboarding or offboarding - nothing will dampen end user enthusiasm like a 10 working day wait to get access to a platform. Except perhaps a cumbersome authentication process. As of time of writing Single Sign On is required for Unity Catalog enabled workspaces. Either way SCIM also needs to be set up, although we are looking to simplify this in future with new features. You can find the latest guidance in the documentation here.

Data Access - Not just the high level of catalogs and schemas, but down to the data inside the tables themselves like PII. Some of this is faster implementation of features like masking, but it starts becoming more complex (and time consuming) with tokenisation strategies.

Immediate action: Walk through access journeys either hypothetically or with users, and map that to access they’d need. Now for the fun counter argument; think of the worst case scenarios of data breaches and work backwards. People are much better at describing what they don’t want to happen than what they do.

Future action: The good news is that there’s not much here except keeping an eye out for new security features that might make people’s lives easier.

#6 Overlooking the “why”

This is much harder to quantify as a checklist, but we’d categorise it as “missing the wood through the trees”. Here are some examples we’ve experienced:

The original ask: Migrating three separate identical pipelines that all did the same thing to be used by three separate teams
The solution: Have one pipeline with access to three separate teams

The original ask: Migrate to new dashboards …that the end user exported to make their own custom excel
The solution: Write the results directly to excel

The original ask: Make the new pipelines faster with bigger clusters
The solution: Remove 95% of the redundancy by switching to incremental processing

The original ask: Optimize the longest running pipelines to save money
The solution: Optimize the most expensive pipelines to save money

Wherever you’re reading this blog, please add your own story in the comments section in hope that it might help people identify areas people might be going wrong.

It would be naive to advise “don’t do this” because it overlooks the unique position we as Databricks find ourselves in. We can ask awkward questions because we’re not beholden to a hierarchy or politics and are reporting to the contract signer who tends to be removed from the detail.

Why these non negotiables get negotiated

Money.

In order to save money on their migrations, people look to cut corners; but this creates a false economy. “Buy cheap, buy twice” doesn’t just apply to fast fashion, it applies to platform migrations.

A further aspect of this is the budget put into ongoing maintenance. If you wanted your expensive shoes to last longer, you’d have to take regular care of them. Again, the same is true for any platform, not just Databricks. I know this can be a hard sell when the overall goal might be cost savings, but dedicating a few months in the year to improvements is vastly cheaper than a wholesale migration.

If you’re having a hard time articulating this number, it’s easier to sell with long run platform longevity estimates.

Platform life expectancy	3 years	6 years
Migration cost	£1m	£1m
Averaged migration cost per year of life	£333k	£116k
Maintenance months per year	0	2
Maintenance costs @ £50k per month	£0	£100k
Total	£333k	£216k

Migration costs are predominantly made up of people time. Not just the small army of developers, but project managers, change management, enablement and onboarding teams.

The cheapest migration is the one that’s not needed.

Closing thoughts

Well done for making it to the end of this blog, we appreciate it’s not a light read. Here are summaries of all the actions we recommend you take in the short and long term.

Immediate actions:

Chat to all your users
Have a proper clear out to save money in the long run
Map the old features to the new ones to allow you to do more with the platform. This includes security features
Draw diagrams for architecture and complex code refactoring
Code rewrites
Optimise the worst performing pipelines

Future plans. Schedule time in the future to regularly revisit:

Pipelines for relevance and decommission any not needed
End users to make sure you have good coverage of platform dependencies
New features, either through the documentation or running small scale tests
Platform refinements for monitoring and performance

We hope this has given you a few ideas for your migrations. Catching things early will increase the odds of success, and hopefully give your career a boost at the same time. If you’ve incorporated a few things into your plans, let us know!

Some of you reading this may be overwhelmed by a sinking feeling as you realise how many gaps there might be in your current plans. First of all, deep breath, it’s a good thing to learn this now instead of a week from the deadline. Second, help is out there. If you’d like help from a third party who has been through it before, Databricks has many consulting partners as well as it’s own consulting services.

Databricks Community

6 Migration Mistakes You Don’t Want To Make: Part 2

#4 Ignoring things that aren’t code

#5 Looking at Data Access and Security as an afterthought

#6 Overlooking the “why”

Why these non negotiables get negotiated

Closing thoughts

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL