cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Oleksandra
Databricks Employee
Databricks Employee

Databricks just released a button to fix all legacy issues in your workspace.

Actually, they didn't. But the fact that you felt a momentary surge of hope—or a cynical eye-roll—proves exactly where you are.

You’re the veteran. You’re the one who remembers times when Delta Lake was just an open source whisper. You’ve built a production environment that is stable, battle-tested, and currently held together by a thousand lines of expert boilerplate that is on your to-do list to rewrite.

We need to have an honest conversation now. As Data Engineers, our mission has never been to "manage clusters" or "write the perfect YAML file." Our mission is to move high-quality data at the speed the business requires. Yet, here we are. Many of us are so buried in the "how" of 2019 that we’ve lost sight of the "why.”

The mission remains unchanged, but the Databricks Platform has evolved. To keep delivering on the promise of Data Engineering, we have to stop doing the heavy lifting that the platform now does for us.

Here are 10 things you need to stop doing today to reclaim the time you need to innovate.

 

# Spend hours finding the perfect cluster configuration

We used to obsess over CPU-to-RAM ratios and instance types, such as i3.xlarge vs. r5d.2xlarge. In 2026, Serverless is the default for everything: Jobs, Notebooks, and even GPUs. It scales at the task level and also has a much more aggressive autoscaling mechanism (compared to classic compute), meaning "over-provisioning just to be safe" is a dead concept.

Things to keep in mind:

  • "Idle Tax" vs.  Execution Premium: When evaluating TCO, don't get distracted by the higher per-DBU price tag of Serverless. In Classic compute, you are paying for the entire lifecycle of the VM—including the 5–7 minutes of "cold start" and the inevitable "idle tail" where a cluster stays up in case another job starts. Don’t forget about the hidden costs of “just in case” over-provisioning and the overhead of managing VM pools. By charging only for the exact seconds of query execution and utilizing aggressive autoscaling, Serverless often results in significant cost savings. Perhaps more importantly, it frees up DE time, a TCO factor that is rarely included in overall calculations.
  • Monitoring: The old Spark UI is gone and replaced by a consolidated Monitoring Page, which is quite different from the old way of searching for bottlenecks, so reserve some time to adjust.
  • Versionless: No more pinning DBR versions; the platform manages engine updates automatically. Sounds good, but it also requires extending your test coverage and extra considerations for strictly regulated environments.
  • Locked Configs: You can't inject 50 lines of spark.conf anymore. If your code needs obscure tuning to survive, it needs a refactor. It’s also a good time to ask yourself if you really need these configurations. With the release of Spark 3 and now Spark 4, many state-of-the-art Spark parameter tuning features have become part of the Spark core.
  • Environments instead of Init Scripts: Serverless uses Environments for libraries. It’s faster and more secure, but requires moving away from legacy init scripts.
  • Niche limitations: Check the current list on the website to spot all less common scenarios.

If you aren't ready for full Serverless due to complex dependencies, here are some bonus features of Classic compute:

  • Flexible Node Types: If your primary instance is out of stock, the cluster automatically falls back to a compatible type, rather than failing.
  • Automatic Cluster restart: It prevents long-running clusters from getting slow over time. Databricks now automatically restarts long-running clusters during a scheduled maintenance window to apply the latest OS-level security images and patches.

The Migration Path

  1. Audit your code: Identify legacy mounts, init scripts, and custom Spark configs that will break on Serverless.
  2. Estimate TCO: Compare the cost of Serverless for your specific workloads. Although overall serverless provides better TCO, don’t be blindfolded by the benefits, ignoring the costs.
  3. Usage governance: Implement Budget Policies and tags immediately to prevent accidental "infinite scaling" costs.

# Rely on DBFS mounts for data access

What started as a good idea of providing users with access to file storage quickly led to a maintenance disaster and a governance loophole. Add to that the elevated permissions required to set it up and the lack of workspace isolation. It’s not a surprise that new accounts don’t support mounts anymore in favour of Unity Catalog Volumes and External locations. 

UC Volumes provide a path-based interface to the non-tabular data (think PDFs, images, CSVs). Unlike mounts, Volumes are governed by the same GRANT/REVOKE logic as your tables.

Things to keep in mind:

  • Managed vs. External: You must choose between managed and external volumes—a debate similar to managed vs. external tables. UC manages layout and deletion for managed volumes. If choosing external volumes, remember that access will be automatically governed in UC but not outside. You need to implement extra measures to prevent unauthorised access outside of the Databricks ecosystem.
  • Programmatic access: If you want to create volumes programmatically, you need to use SQL command (or UI) since you can’t create or delete /<catalog>/<schema>/<volume> part of the path with filesystem commands.

The Migration Path

  1. Check your code: Migrate /mnt/data to /Volumes/main/default/my_volume/ in your code base. Don’t forget to set proper permissions on the Volumes. Since mounts are available to anyone in the workspace, more fine-grained permissions might break your pipelines, but they will definitely improve your governance.
  2. Clean up: Don’t forget to clean up unused mounts and cluster configurations.

# Build an ingestion pipeline for SQL databases just to query data

Every time a stakeholder requests a report that involves data from a relational database, we default to building an ingestion pipeline (including writing the JDBC connection, handling the partition logic, scheduling the job, and storing a duplicate copy of the data in Delta so that you can join it with your existing bronze tables).

Although ingestion is sometimes necessary, you can also leverage Lakehouse Federation for ad hoc queries to relational databases. Since it is “mounted” to UC as a foreign catalog, it also respects all permissions from UC, as it was yet another Delta table. Besides easy management, it also offers one place to overview all your connections to improve security and management (instead of hardcoded values or secrets)

Things to keep in mind:

  • Double compute: The Lakehouse federation pushes execution to a remote system, but the final query still occurs in Databricks, resulting in double compute and increased costs. If you expect this data to be queried frequently, consider using Materialized Views or ingestion.
  • Lakehouse Federation vs. Catalog Federation: Catalog Federation allows eliminating double compute, but the supported sources are limited as of the time of writing.
  • Performance: Although it offers additional performance improvements, it remains a wrapper on top of JDBC. You can’t specify which version of the driver to use, but you still can configure parameters like partitions.
  • Monitoring: Since it stitches together two systems, it can be challenging to pinpoint the exact location of the performance bottleneck. You can use EXPLAIN FORMATTED to verify that your join is being pushed down; however, you will still need to inspect both the database logs and Databricks Query Profile.

The Migration Path

  1. Before you begin: check how the SQL database data you ingest is actually used. Consider the following questions:
    1. How often do users query this data? Are those queries the same or quite diverse?  - If you need to query the same table over and over again, consider ingesting it.
    2. Is it only ingestion, or does your code write results back? - Lakehouse Federation doesn’t support writes.
    3. What are latency requirements? - As a federation, it always has an overhead that you might take into account; check Lakebase or ingestion as an alternative approach for low-latency flows.
  2. Pay attention to sources with case-sensitive identifiers. Although we all know that having “TaBle” and “table” in the same database is a bad practice, it still happens. Since UC is case-insensitive, you may encounter unexpected issues.
  3. Check database dialect: If you have a lot of queries written in the specific database dialect, check out the remote_query table-valued function. It allows running SQL queries directly against external databases using the native SQL syntax of the remote system. At the same time, it still relies on the configured connection setup, which improves governance of your connections by bringing them in one place.

# Maintaining a Private Library of Custom Connectors

We all have a git repository of custom-built ingestion wrappers. But if you have a "standard" Python class for fetching data from the Salesforce API, a Scala utility for MongoDB, and a complex set of scripts to handle OAuth2 handshakes for various SaaS platforms, now it’s time to revisit some of them and replace them with managed OOTB alternatives. Lakeflow Connect offers a variety of connectors to ingest from local files, popular enterprise applications, databases, cloud storage, message buses, and the list keeps evolving. What you get with it is governance using Unity Catalog, orchestration using Lakeflow Jobs, and holistic monitoring across your pipelines.

Things to keep in mind:

  • Connectors is a product, not a library: Lakeflow Connect connectors are not a library you can download, but rather a part of Lakeflow functionality that offers different levels of customization and flexibility. You can use these connectors with ingestion pipelines (think SDP provisioned for you), custom SDP, and structured streaming with different configuration options that vary per connector.
  • Standard vs Managed connectors: Standard connectors typically allow more customization options than Managed ones. Confusingly, some sources have both connectors. The rule of thumb is to stick to the managed connector before evaluating the standard one.

The "Kill Your Darlings" Checklist:

  • Standard Databases: MySQL, SQL Server, PostgreSQL
  • SaaS Platforms: Confluence, Google Analytics, NetSuite, Salesforce, ServiceNow, SharePoint, Workday, Microsoft Dynamics 365, Meta Ads
  • Cloud Storage and formats: S3, ADLS, GCS, SFTP, Kafka, Kinesis

The Migration Path

  1. Assess your current state: An effective action plan heavily depends on the current state of your environment. If you’re already using SDP and Lakeflow Jobs, the next step is to investigate the specifications and limitations of the existing connectors. Depending on how your code base is structured, you might need to refactor separate tasks or whole pipelines. You don’t have to know everything about SDP to utilize connectors, but being familiar with the UI will help. If you’re using an external scheduler and not relying on Databricks capabilities, the migration curve might be too steep and require a solid strategy.
  2. Decouple Ingestion from Logic: Lakeflow Connect is built for high-performance raw ingestion. It isn't a transformation engine - and it shouldn't be. It allows for keeping ingestion clean and lightweight. However, when migrating, you may need to perform additional refactoring to separate ingestion from logic.
  3. Understand your history and refresh requirements: Although Lakflow Connect supports SCD Type 1 and 2 (depending on the connector), it’s essential to understand how it aligns with your specific needs.
  4. Pay attention to wide tables: Lakeflow Connect relies on serverless for pipeline execution and is quite accurate with the compute size. However, pay closer attention if you’re ingesting very wide tables, as the pipeline might restart with larger resources. Try to exclude columns you don’t need for better performance and costs.

# Provisioning yet another small SQL database for custom apps

Your company builds a small internal tool—maybe a data quality dashboard, a metadata-driven orchestrator, or a retail inventory tracker. To support it, you provision a “small” database to store underlying data. Next, you need a pipeline to move data from your lakehouse to this database and back. Before you notice it, you’re managing 15 isolated SQL instances with twice the number of ELT pipelines to move data back and forth, all with their own secrets, backups, and users. In 2026, the architectural boundary between analytical data and operational database has dissolved. With the introduction of Lakebase, Databricks has optimized the Lakehouse to handle the high-concurrency, low-latency workloads typically reserved for traditional SQL databases.

Things to keep in mind:

  • Automatic sync pipeline: The biggest timesaver is the ability to set up and sync pipelines from your Lakehouse to Lakebase and back. Currently, those two pipelines are using different mechanisms:
    • Registering Lakebase tables in UC relies on Lakehouse Federation (keep an eye on the release page for new approaches) 
    • Sync from Lakehouse to Lakebase uses Lakeflow Spark Declarative Pipelines to continuously update both the Unity Catalog synced table and the Postgres table with changes from the source table. You can choose between triggered and continuous mode for “real-time” sync and snapshot mode for a full refresh. For Triggered or Continuous sync modes, the source table must also have change data feed enabled.
  • Maturity curve: At the time of writing, the Lakebase product is still under heavy development, so some confusion may occur when comparing Provisioned vs. Autoscaling mode.

The Migration Path

  1. Assess your SQL Database: Lakebase supports Postgres as a database dialect. If your applications are built on other dialects and heavily rely on dialect-specific features, you’d need to evaluate “your database to Postgres” migration plan.
  2. Sync requirements check: In many cases, you don’t need to sync data in near real time; choose the proper sync mode that balances requirements and costs.
  3. Consider autoscaling. Since one of the biggest innovations of Lakebase is autoscaling, leverage it to save costs.
  4. Evaluate SLAs: Properly assess current limitations for mission-critical applications and ensure that your SLA and DR/HA requirements are fully met.

# Build ETL pipelines to monitor… ETL pipelines

Let’s be honest, we all have it: a dedicated folder with “audit” notebooks, a list of “proven” regex to parse logs, and a few ETL jobs to scrape metrics for the production pipelines. There are numerous questions you need to be able to answer about production, and you must have the necessary data ready. It’s obvious that maintaining these meta-pipelines is an overhead, but there is no other option, right?

Not quite. In 2026, Databricks exposed its internal telemetry directly to you via System Tables. Located in the system catalog, these are fresh, managed tables that track every event, cost, and access across your entire account.

The "Big Three" Tables You Should Use Today:

  • system.billing.usage: The biggest source of information about costs and consumption. By the way, pre-built Usage dashboards using this table to visualise your spending
  • system.compute.clusters and system.compute.node_types: Use these to audit your infrastructure. You can instantly see which clusters are still using legacy instance types or which ones are running for 24 hours without processing a single row.
  • system.lakeflow.jobs: Captures information about the jobs in your account to monitor job configuration and state.

Things to keep in mind:

  • No real-time support: System tables are not (yet) refreshed in real-time; don’t rely on them for incident alerting. If you need a rapid response to errors or a failing pipeline, it should be part of your monitoring and observability solution, while system tables are used for analytical and reporting purposes.
  • System.information_schema: It behaves differently from other system tables: it only shows you the objects you have access to in UC.
  • Retention period: is mostly 365 days. If you want to retain this data for a longer period (for instance, for audit purposes), ensure that you copy it to your cloud storage.
  • Potential inconsistencies: There may be inconsistencies in column naming conventions between system tables for different products; please bear with us while we undergo schema evolution.

The migration plan:

  1. Evaluate your monitoring landscape: assess your analytical needs and replace references to custom tables with corresponding system tables. If you can’t find an appropriate table yet, keep an eye on the release notes, as it may appear later. The same applies to the new products.
  2. Query refactoring: Since system tables are just Delta tables under the hood, the main part of the migration will be writing queries to extract the correct information.

# Have a custom pipeline to run vacuum and optimize

When VACUUM and OPTIMIZE were introduced, it was a game-changer for the maintenance tax. However, it raised an existential question about when and how often to run these operations. Predictive optimisation makes this discussion obsolete. You don’t need to create a custom pipeline to run these operations and identify the tables that will benefit from them esoterically. Predictive Optimization is natively aware of existing snapshots and concurrent writers, significantly reducing the ConcurrentAppendException or FileReadException errors common in custom scripts.

Things to keep in mind:

  • Serverless compute: Predictive optimization requires serverless compute to run. You don’t need to worry about when it should be scheduled, but you still have costs for these runs. The costs show up as a specific "System Component" in your billing.
  • Monitoring: Use system.storage.predictive_optimization_operations to gain visibility into when the operation ran.

The migration plan:

  1. Check if your tables are Managed or External: Predictive optimization can be enabled only for managed tables.
  2. Check if tables use Z-order and liquid clustering, and pay attention to the way Predictive optimization approaches these tables.
  3. Set data retention for vacuum: Before enabling predictive optimization for an account/catalog or schema, ensure you have the proper settings for delta.deletedFileRetentionDuration to prevent the removal of history from the tables.
  4. Clean up: Don’t forget to delete the existing maintenance pipeline to remove double costs.

# Still rely on Hive-style partitioning

Proper partitioning of a large dataset that is used by multiple stakeholders in various scenarios is an art, not a science. Debates over whether you should partition a table by country or by date dominated many DE meetings. However, the worst aspect of partitioning is the need to repartition or adjust it when the read or write patterns change. Over-partitioning leads to many small files. Partitioning that doesn’t match writing patterns leads to skew, which in turn leads to slower queries and frustrated stakeholders.

In 2026, Liquid Clustering has effectively deprecated Hive-style partitioning for high-scale Delta tables. Instead of forcing data into a fixed, hierarchical folder structure, Liquid Clustering uses a flexible, multi-dimensional data layout.

Things to keep in mind:

  • Liquid clustering vs. Partitioning vs. Z-order: Liquid clustering is an alternative to both partitioning and Z-order. You can’t use the combination and need to choose one.
  • Delta protocol versions. Delta tables with liquid clustering enabled use the Delta Writer version 7 and the Delta Reader version 3. Delta clients that don't support these protocols cannot read these tables. You cannot downgrade table protocol versions. It’s not a problem for new applications, but it can lead to nasty issues for the existing readers and writers.
  • Liquid clustering is incremental, but it requires scheduling: you must trigger it by calling the optimize command. In this sense, it’s different from partitioning, which is part of your write process.
  • Reclustering: You don’t need to rewrite your data with liquid clustering, but full reclustering on a large table might still take hours. Still, in most cases, it’s less than the whole repartition.

Migration plan:

  1. Audit your most expensive tables: specifically those with deep folder hierarchies. Re-create them using Liquid Clustering.
  2. Check the Delta writer and reader version: Since Liquid clustering requires a certain version of the writer, check that your code actually uses the correct write function.
  3. Select clustering key: and don’t forget to have a scheduled optimise job or enable predictive optimisation.

# Checking _SUCCESS file to trigger your task

The orchestration of the data pipeline sometimes feels like a game of "waiting for a signal." There are multiple approaches to determining if the processing of the table is completed. Long prior to the invention of table formats like Delta and Iceberg, one of the proven methods was to wait for the _SUCCESS or _committed file to appear and start your processing after that. Although times have changed and we work with tables, there was no easy way to know that your table was updated. Until now.

Table update triggers trigger the Lakeflow job when the selected table(s) are updated.

Things to keep in mind:

  • Table trigger and autoloader: This feature is essentially like an autoloader for tables and has a very similar underlying mechanism. Meaning that you spare the costs of running a cluster to check for updates, but you still have a fraction of the costs from the cloud storage provider.

The migration path:

  1. Assess your current state: Similar to the Lakeflow Connect chapter above, the migration path heavily depends on your current level of Lakeflow adoption as a scheduler. If your jobs are already Lakeflow jobs, then migration is reasonably straightforward and requires some cloud setup and code cleanup.

# Ignoring built-in AI functions

If you’re still writing custom Python UDFs with complex regex patterns to parse complex semi-structured data or documents in 2026, you either ignore the whole AI hype or have enough scars to be sceptical about AI. Of course, you might have other reasons, but there are many built-in AI functions that can take many mundane tasks off your shoulders. You can even use them directly within SQL.

Things to keep in mind:

  • AI

  • Task-specific functions: The library of native functions—like ai_analyze_sentiment, ai_fix_grammar, and ai_extract—is extensive. They allow you to replace hundreds of lines of brittle Python with a single SQL declaration. It’s convenient, but don't let that convenience blind you to the underlying complexity.
  • Performance considerations: Since it’s part of SQL, you can use functions everywhere you use this SQL code, but pay attention to the performance on your specific datasets or documents. Before stripping out an extremely optimized custom UDF, benchmark the execution time on a representative sample of your documents.

The migration path:

  1. Low-hanging fruit: Start by evaluating where using AI functions makes sense. Are you maintaining code to parse PDF invoices, which are pretty straightforward - a good candidate for applying AI functions. 
  2. Evaluate quality: Don’t forget to stay critical about the quality, especially for custom models and specific AI functions. However, it’s a broader discussion about AI.
  3. Cost Vectoring: Understand the costs associated with running AI functions on large datasets. If you are running these functions on a 100-million-row Bronze table daily, the "simplicity" could come with a massive price tag.
  4. Consider privacy, regulatory requirements, and regional availability as part of your evaluation.

As Data Engineers, our mission is to build the pipelines that power the future, not to act as permanent life support for the past. Every manual habit you kill today is an hour reclaimed for the architectural work that actually moves the needle. And the great news is that the platform is ready to carry the load; you just need to use it to its full potential.

 

Contributors