Databricks Community

holly · ‎09-09-2024

Congratulations! Your team has just signed a contract with Databricks and you’re ready to start your first project.

Maybe.

Statistically, most of you will need to migrate off your previous platform before you can start building something new. Or maybe the ‘new’ thing is the same data project but faster, more reliable or cheaper.

Between us, we have worked at Databricks for a combined 7.5 years and have worked on over 30 customer migrations ranging from upfront planning to hands-on execution, and unfortunately, remediation work when something has gone wrong.

With so much experience, there are trends that come into sharp focus, along with a spidey sense of what will work and what won’t. Here are our top 6 Migration Mistakes we see people make, what you can do about them, and how to avoid falling into these traps.

#1 Moving everything

Tables

1000 tables in the old system means 1000 tables in the new one, right? Not so fast. There are many you can exclude from this total:

Backup tables. It’s better to make these with Delta cloning rather than migrating two of the same table
Materialised Views & Tables that are overwritten as part of the pipeline. By definition, they’ll be replaced as soon as the pipeline starts again
Staging tables (potentially) - will it be cheaper to ingest everything in one go from source systems rather than a potentially expensive cross-storage transfer

You may still need to define the 1000 tables, their locations, the columns and data types, but these can be done with a create table statement that is far cheaper and quicker to run.

Pipelines

We have spent a depressing amount of time debugging pipelines only to find that they’re obsolete and no one uses them any more.

This sounds obvious when you write it out, but paying money for things you don’t use has a terrible return on your investment.

Technical Debt

Everyone has technical debt; it’s an inevitable part of any representation of squishy humans and evolving business practices in 1s and 0s. It’s a sliding scale of how much is acceptable; and a migration is a good opportunity to pay it down and start fresh with a new platform.

Credit: Vincent Déniel Credit: Vincent Déniel

When we drafted this blog, we realised we had wildly differing views of what technical debt entailed, and how important it was to resolve, and when. After heated discussions and more research we settled on the following categories and when to resolve them:

Business logic and data quality: changing classifications, inconsistent date formats, test data, the list goes on for sources of bad data quality. This is normally ‘fixed’ on an iterative basis, inconsistently, and at different points in the pipeline.

Data modelling: but perhaps all the data is representative, it’s just that getting it in the right shape is a nightmare. Did someone get halfway through a data vault project only to give up? Does your star schema look more like a supernova? Perhaps the final end state is fine, but getting to that point includes many oddly named tables and redundant steps added over time?

Architectural: stems from using the wrong tool for the wrong job and having to make subsequent work arounds

Governance: often materialises as a lack of control or transparency. Who dropped that table? Was it the same person that spent $5000 in a month? Why do I only discover this six months after it happens?

You are the best judge of your own technical debt, but here are some considerations to help you decide what to do with it. If your technical debt is caused by a limitation of your current system (3 & 4), then make sure time is allocated once you start rewriting your code for the migration.

If your technical debt is 1 & 2 and due to years of neglect, it’s going to be easier to work through it in a system you have familiarity with rather than when you’re trying to learn something new. Do not try to migrate the same time you are fixing technical debt, it’s impossible to validate the correct outcomes if you have nothing to compare it to.

Immediate Action: Obviously, you need to review your estate to see if anything can be decommissioned. But the biggest pushback is that time hasn’t been allocated in this project to do so. So how do you pitch it to the budget holders?

Perform a quick assessment of the areas of your estate. You’re looking to weigh up the potential savings with cost to determine what’s obsolete. As developers, you should have a rough idea of what’s important, who complains the most when outages occur, and who’s always asking for changes.

Compare this with the migration efforts and the ongoing run costs. These don’t have to be exact numbers, just an indication of the order of magnitude for size and complexity. Here’s an example of how to compare these areas:

Business Area	Project Name	Likelihood of obsolescence	Migration Estimate (hours)	Annual Future Run Cost	Potential Savings	Recommendation
Sales	Alpha	S	250	$$	$	Migrate
Sales	Beta	M	500	$$$$	$$$$$$$	Thorough investigation
Marketing	Gamma	L	20	$	$	Quick meeting with Marketing

In our example, Marketing might have the most obsolete pipelines, but if it’s a handful of tables, so don’t invest much time getting to the bottom of it. That Sales project however looks like it’ll be the bulk of migration, and then a hefty part of the bill afterwards, so that warrants investigating what can be decommissioned.

Future Action: Set up a regular annual review to audit what you have and see if anything can be decommissioned or descoped. If you’re keeping an old pipeline as a backup to the new one, does it have to be streamed real time, or could it move to a cheaper weekly batch?

The good news is that this is significantly easier with Unity Catalog and system tables. These contain stats on how frequently tables are used, along with lineage details to generate reports on your most unloved areas.

Of course, finding what’s still in use is only going to be possible if you speak to the people using the final data products.

#2 Not speaking to all your user groups.

And we mean all of them.

No one likes surprise project scope or having to report the migration as ‘Red’ because there’s a group of users unaccounted for. This can be particularly egregious if they’re using a tool you’ve not made integration plans for.

This is easier said than done, and the best way to approach this is going to depend heavily on your organization. Here are some mistakes we see being made:

Relying on a single channel for communication. Don’t just rely on emails - make sure it’s brought up in SteerCo, added on to internal sites etc.
Communicating at a single level. Telling a Director might seem like the most appropriate route to go, but not all of them will know all the dependencies a team has
Using language and acronyms that aren’t shared across teams. Holly recently received a deluge of EMU MFA upgrade emails that got promptly ignored …only to find she lost her github access.

There are stories of teams so frustrated with the lack of response that they paused all pipelines to see where the complaints started coming in from. While we would never endorse such an extreme approach, it can be effective in identifying users, but it might prove to be detrimental for future collaboration efforts.

Immediate Action: Recognise some of these mistakes? Short of hiring a sky writer, think of ways to communicate at multiple levels.

Future Action: Incorporate tracking into the future platform, either through enforced tagging of resources, access management, or community building.

#3 Cut and paste code as is

By definition, you’re using Databricks because it does something different. This means your code will have to change to get the most out of the platform. Like onions, this has layers from the super obvious all the way down to the more nuanced.

Spark-ifying Python and Scala: If your code doesn’t use Apache SparkTM, you’re going to end up with a horrendously inefficient, expensive pipeline. Your data frames need to be Spark data frames, created with spark.read, and you’ll know you’re doing it right if this comes up at the bottom of your code

Learning Spark is easily doable; however, it isn’t a trivial activity. It’s not just about writing the code, but also learning how to distribute your code to make it more efficient. If you have some Python experience but have never used Spark before, now might be the time to tap into your development fund.

If you are using SQL, good news! This is going to be less of an issue as the Spark engine is able to interpret SQL code and apply the benefits of Spark without much editing.

Switching to Delta Lake as your storage format: Delta Lake has phenomenal performance, especially when using Spark. Sticking to legacy formats like Parquet, ORC, or Avro is going to impact your performance …and your budget.

The move to Delta requires less complex code changes. Databricks writes to Delta by default, so removing references to file type will work well. MSCK repair table and refresh table no longer do anything (they’re refreshed and repaired by default) so these need to be removed. Delta may require maintenance with optimize and vacuum but instead we’d recommend Predictive Optimization so you never have to think about it again.

Remove those legacy Spark settings: In the days of Hadoop, it was common to go through every. single. setting. and manually set them. In Databricks this is often completely unnecessary, and often works against you if you’re using Photon, and is especially egregious when using Serverless. We’re so confident that Serverless will be more efficient, that if it’s worse, please let your account team know so they can pass it on to engineering to get it fixed.

Both of us have experience in optimizing pipelines where the culprit was a rogue setting causing a bottleneck.

That’s not to say they’re never set, but start from a position of defaults and work from there.

Address technical debt caused by legacy systems: Holly once worked on a codebase where 80% of it was checking what had been loaded so that data wasn’t double counted. Databricks has a suite of tools to handle incremental processing and so could have removed a big chunk of that code base.

Unfortunately, some of this is an unknown when starting off with a new platform. Especially if you take your pipelines as a ‘fact’, or worse, don’t know why they behave in this way in the first place. It can be worth planning out bigger pipelines upfront before coding and then running them past a Databricks expert once you have the initial designs in place. This doesn’t have to be someone at Databricks, it could be someone else in your organization who has been using the platform for a while.

A final warning: Beware the AI generated code converter. They definitely have their place when it comes to repetitive, arduous work or for generating ideas. They’re not at a place where you can trust everything they create. Even the Databricks Assistant (the best AI coding assistant on the market for Databricks) can make overly verbose answers or overlook simplifications that a human wouldn’t make.

Immediate actions: Armed with this additional detail, incorporate any high risk areas into timelines. For complex pipelines, map out what changes need to be made before starting to code. If you’re working with large teams, write a guide or checklist on what needs to be changed.

Future actions: New features will always be available in Databricks that can bring even more efficiencies (aka cost savings!). Set regular placeholders in your calendar to go through the release notes and where in your stack it could be relevant.

Intermission

It’s at this point we recognise this is a long blog. Seriously long. So long that we’re about to hit the word limit on the community site, and you probably have other things you need to get done. Part 2 can be found here.

Databricks Community

6 Migration Mistakes You Don’t Want To Make: Part 1

#1 Moving everything

Tables

Pipelines

Technical Debt

#2 Not speaking to all your user groups.

#3 Cut and paste code as is

Intermission

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks