cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
srikantdas
Databricks Employee
Databricks Employee

You likely maintain at least two separate copies of your crucial data. One resides in your data lake, serving as the source for pipeline writes, ML model training, and engineer debugging. The other is in a cloud data warehouse, which powers dashboards and is used by analysts for querying.

The task of keeping these two copies synchronised is a significant undertaking. It requires building, monitoring, and maintaining an ETL pipeline, and someone is inevitably paged when schema changes cause a failure. Furthermore, your dashboards will inevitably display outdated information until the next synchronisation cycle completes.

This leads to a more fundamental question than simply selecting the right cloud data warehouse: Do you truly still need a dedicated, separate data warehouse at all?

The hidden cost of the two-system stack

Cloud data warehouses are fast and well-understood. They're also built on an assumption that your data should live in their storage, optimised for their engine. You still run ETL to build your data layers, that doesn't change. But with a separate warehouse, you then run a second operation to load your gold layer into proprietary warehouse storage. Two systems to keep current, two governance planes over the same data, two bills.

The costs that don't appear in the per-query line item:

  • Sync pipeline maintenance: building, monitoring, and repairing the ETL from your lake to the warehouse; rebuilding it when source schemas change
  • Storage duplication: the warehouse holds a copy of data already in your lake - two separate storage bills for the same bytes
  • Governance drift: your data lake has one set of access controls, your warehouse has another; when they diverge, you carry compliance risk until someone reconciles them
  • Freshness lag: until the next sync runs, analysts query data that's behind - and the gap between systems becomes a recurring explanation to the business

These aren't edge cases. They're the operational reality of a two-system analytics stack.

From a separate stack to a Lakehouse-first architecture

Most analytics stacks today are split across two systems. The data lake stores raw and engineered data, pipelines land it, ML models train on it, engineers debug from it. The cloud data warehouse holds a copy in proprietary storage with its own engine from where the BI dashboards point. An ETL/copy step bridges the two, and access controls are maintained on both sides.

srikantdas_0-1779206163736.png

But the workload mix has shifted. BI, ML, and AI now want access to the same governed data, and keeping them on separate platforms keeps adding cost. The lakehouse-first architecture takes a different starting point: one storage layer in open formats, one governance plane, and engines specialized for each workload reading the same tables.

srikantdas_1-1779206163736.png

Benefits of the Open Lakehouse

The Open Lakehouse on Databricks has some key differences vs a typical Dedicated OLAP stack.

No load step between lake and BI - With Databricks SQL, your gold-layer Delta/Iceberg tables are queryable directly by the BI engine. There's no separate load operation to copy data into warehouse storage to unlock full query performance, your notebooks, dashboards, and production ETL all read the same files at the same freshness. The sync between lake and warehouse disappears because there's no separate warehouse storage to sync to.

One governance layer across all workloads - Unity Catalog applies one set of grants, row-level and column-masking policies, lineage tracking, and audit logs across ETL pipelines, BI dashboards, ML notebooks, and AI workloads, all on the same underlying tables. There's no second governance system to configure or keep aligned. The advantage scales with how diverse your workloads are; it matters most when ML and AI workloads are part of the picture. Beyond grants and row/column-masking policies, With Unity Catalog, governance scales effortlessly alongside your expanding data estate. By utilizing Attribute-Based Access Control (ABAC), policies defined once for table and column tags are automatically applied across thousands of objects. This eliminates the need for manual, per-table grant management and prevents administrative overhead as your data environment grows.

AI native in SQL, with an open model roster - Databricks SQL features a number of native AI Functions that you can use to integrate AI within your SQL queries. One of the most popular among users is ai_query(). ai_query() calls any model endpoint from SQL, any OpenAI-compatible API, your own self-hosted model, or a Databricks-served model via FMAPI against the same Delta/Iceberg tables your BI dashboards query. Models registered in Unity Catalog via MLflow are callable from SQL too. In addition, you can run specialized AI functions curated and tuned by Databricks for specific applications such as intelligent document processing, data masking (for PII redaction), entity extraction, and a number of others. There's no second system to feed, no copy-out, no side-pipeline to a separate ML platform. Some cloud warehouses now offer native AI SQL functions, but the model roster is locked to whatever the vendor chooses to host.

High Performance in the Open Lakehouse Architecture

Though the architecture is different, there is still a core execution engine at the heart of the stack that drives query performance. Databricks SQL is the engine powering Databricks for the same tasks cloud data warehouses are built for: high-concurrency business intelligence (BI), queries heavy on aggregation, and large-scale interactive data exploration. It is a purpose-built SQL surface for this architecture, not just a thin SQL layer on top of Spark.

Databricks SQL enables high-concurrency BI and interactive analytics directly on Delta/Iceberg tables in open storage. Crucially, it eliminates the need for a separate data loading step, distribution key planning, and an isolated governance system. It achieves this through several key innovations.

Photon execution engine: Photon is a native vectorized execution engine written in C++, purpose-built for SQL on Delta/Iceberg. Databricks SQL with Photon has continued to deliver year-over-year performance gains across BI, ETL, and ad-hoc workloads (see Databricks' 2025 DBSQL performance review). Customers migrating from legacy cloud data warehouses typically see multi-x reductions in query latency on aggregation- and join-heavy workloads, though results vary by query profile and data layout. For example, PepsiCo built an enterprise-grade Data Intelligence Platform on Databricks - unifying BI, ML, and AI workloads on a single Unity Catalog-governed Lakehouse and replacing fragmented legacy systems (see PepsiCo case study).

Liquid Clustering: No upfront key design. Most cloud data warehouses require you to define distribution keys, sort keys, or cluster keys before you know your query patterns. When patterns change, you rebuild tables. Some vendors have automatic clustering which helps but still requires predefined cluster keys upfront. DBSQL Liquid Clustering takes a different approach: no upfront key definition. The system observes actual query patterns and reorganizes data incrementally in the background. As workloads evolve, the layout adapts without table rebuilds, without schema changes, without downtime.

Predictive Optimisation: Automated compaction, statistics upkeep, and clustering run continuously in the background. Existing queries get faster over time as a byproduct of normal platform operation not as the result of a manual tuning cycle.

Comparing the Fundamentals

The difference in these two architectures as well as their underlying engines and how they interact with other upstream and downstream components leads to different results, which are summarized in the table below.

Side-by-side

Dimension

Dedicated OLAP Stack

Databricks SQL on Open Lakehouse

Data storage

Proprietary internal format; full performance requires loading a copy into the warehouse. Some offer open-table reads but with write/feature restrictions

Open Delta/Iceberg Lake on your object store.Same files for ETL, ML, BI.

Query engine

Vectorized, columnar (varies by vendor)

Photon vectorized C++ engine

Layout tuning

Distribution, sort, or cluster keys; predefined upfront, manual to change

Liquid Clustering - no upfront keys; adapts to actual query patterns; no rebuilds

Governance

Warehouse catalog governs warehouse; Data lake has its own; ML/AI has a third layer

Unity Catalog - Open & spans ETL, SQL, ML, and AI on the same tables

AI in SQL

Native AI SQL functions in some warehouses; model roster locked to what the vendor hosts

ai_query() to any model endpoint, OpenAI-compatible APIs, self-hosted, or Databricks FMAPI. Specialized AI functions for parsing, classification, extraction, translation, PII. MLflow-registered models callable from SQL- all on the same Delta/Iceberg tables

ML / AI Integration

Copy-out to a separate ML platform, or in-DWH ML restricted to vendor-provided algorithms and runtimes. Capped by the warehouse's compute envelope - ML cannot scale independently of BI, so training and inference compete with dashboard concurrency.

MLflow + Model Serving + FMAPI on same Delta/Iceberg tables, no copy

Lake-to-BI load

Required for full performance; open-table read paths exist but with restrictions

Delta/Iceberg queries in place - no load step needed

Data freshness

Refresh cadence of sync pipeline

Direct Delta/Iceberg queries; streaming tables for near-real time.

BI / Dashboarding

Separate cloud service; its own storage, governance, billing

Bring your own BI (Power BI, Tableau, Looker, etc.) on the same governed tables - or use AI/BI Dashboards and Genie natively. No additional license fees, full Unity Catalog governance, and a built-in agentic experience for analysts (natural-language Q&A, conversational drill-down, AI-assisted explanations).

Data Sharing

Restricted to same-vendor consumers

Delta Sharing — open protocol; broad consumer support (Power BI, Tableau, pandas, Spark, partners).

How to decide

Run your actual dashboard queries against DBSQL Serverless before making any platform decision. Databricks publishes TPC-DS benchmark results, and you can benchmark your own workload on real data in hours.

Then evaluate on four dimensions:

  • Latency. What P95 do your slowest dashboard queries require? Sub-second is achievable with proper data layout. DBSQL handles typical BI concurrency (tens to a few hundred simultaneous users) well. Very high-volume embedded analytics at thousands of concurrent users can work, but typically benefits from additional architecture for e.g., right-sizing DBSQL Serverless for sustained QPS, query result caching, materialized views over hot aggregates, and isolating embedded traffic on a dedicated warehouse so it doesn't compete with internal BI.
  • Freshness. DBSQL queries Delta/Iceberg tables at whatever freshness the data was last written - the engine adds no additional lag beyond query execution time. With streaming tables and materialized views, you can achieve sub-minute freshness for continuously arriving data.
  • AI and ML workload mix. Do your data engineers, ML team, and SQL analysts all work on the same underlying data? If yes, the "one governance layer" and "no copy between AI and BI" arguments get stronger. The question is whether having them on separate platforms is creating operational costs that's not being measured.

Signals it's time to re-evaluate

  • You're running a pipeline whose only job is keeping the warehouse in sync with the lake - and it requires ongoing maintenance and breaks regularly
  • Your ML or AI workloads depend on warehouse data, so someone built a side-pipeline to move data between systems
  • You're maintaining two sets of access controls over the same tables, and reconciling them comes up in compliance reviews
  • Query patterns shift faster than your distribution-key or cluster-key design can accommodate, and rebuilding tables has become routine

If two or more of these apply, benchmark your real workload against DBSQL. The data warehouse migration cost is almost always lower than the compounding operational cost of running an architecture that doesn’t scale with your business needs and user demands.

The standalone cloud data warehouse made sense when lake and warehouse served genuinely different purposes. As those purposes converge - and as more workloads want access to the same governed data - the question is whether maintaining the separation is still paying for itself.

Get started

Run your most expensive OLAP query against DBSQL Serverless this week. Databricks Free Edition is free to start. Test real queries on real data, and let the results guide your decision.

<Related Databricks Blogs>