You likely maintain at least two separate copies of your crucial data. One resides in your data lake, serving as the source for pipeline writes, ML model training, and engineer debugging. The other is in a cloud data warehouse, which powers dashboards and is used by analysts for querying.
The task of keeping these two copies synchronised is a significant undertaking. It requires building, monitoring, and maintaining an ETL pipeline, and someone is inevitably paged when schema changes cause a failure. Furthermore, your dashboards will inevitably display outdated information until the next synchronisation cycle completes.
This leads to a more fundamental question than simply selecting the right cloud data warehouse: Do you truly still need a dedicated, separate data warehouse at all?
Cloud data warehouses are fast and well-understood. They're also built on an assumption that your data should live in their storage, optimised for their engine. You still run ETL to build your data layers, that doesn't change. But with a separate warehouse, you then run a second operation to load your gold layer into proprietary warehouse storage. Two systems to keep current, two governance planes over the same data, two bills.
The costs that don't appear in the per-query line item:
These aren't edge cases. They're the operational reality of a two-system analytics stack.
Most analytics stacks today are split across two systems. The data lake stores raw and engineered data, pipelines land it, ML models train on it, engineers debug from it. The cloud data warehouse holds a copy in proprietary storage with its own engine from where the BI dashboards point. An ETL/copy step bridges the two, and access controls are maintained on both sides.
But the workload mix has shifted. BI, ML, and AI now want access to the same governed data, and keeping them on separate platforms keeps adding cost. The lakehouse-first architecture takes a different starting point: one storage layer in open formats, one governance plane, and engines specialized for each workload reading the same tables.
The Open Lakehouse on Databricks has some key differences vs a typical Dedicated OLAP stack.
No load step between lake and BI - With Databricks SQL, your gold-layer Delta/Iceberg tables are queryable directly by the BI engine. There's no separate load operation to copy data into warehouse storage to unlock full query performance, your notebooks, dashboards, and production ETL all read the same files at the same freshness. The sync between lake and warehouse disappears because there's no separate warehouse storage to sync to.
One governance layer across all workloads - Unity Catalog applies one set of grants, row-level and column-masking policies, lineage tracking, and audit logs across ETL pipelines, BI dashboards, ML notebooks, and AI workloads, all on the same underlying tables. There's no second governance system to configure or keep aligned. The advantage scales with how diverse your workloads are; it matters most when ML and AI workloads are part of the picture. Beyond grants and row/column-masking policies, With Unity Catalog, governance scales effortlessly alongside your expanding data estate. By utilizing Attribute-Based Access Control (ABAC), policies defined once for table and column tags are automatically applied across thousands of objects. This eliminates the need for manual, per-table grant management and prevents administrative overhead as your data environment grows.
AI native in SQL, with an open model roster - Databricks SQL features a number of native AI Functions that you can use to integrate AI within your SQL queries. One of the most popular among users is ai_query(). ai_query() calls any model endpoint from SQL, any OpenAI-compatible API, your own self-hosted model, or a Databricks-served model via FMAPI against the same Delta/Iceberg tables your BI dashboards query. Models registered in Unity Catalog via MLflow are callable from SQL too. In addition, you can run specialized AI functions curated and tuned by Databricks for specific applications such as intelligent document processing, data masking (for PII redaction), entity extraction, and a number of others. There's no second system to feed, no copy-out, no side-pipeline to a separate ML platform. Some cloud warehouses now offer native AI SQL functions, but the model roster is locked to whatever the vendor chooses to host.
Though the architecture is different, there is still a core execution engine at the heart of the stack that drives query performance. Databricks SQL is the engine powering Databricks for the same tasks cloud data warehouses are built for: high-concurrency business intelligence (BI), queries heavy on aggregation, and large-scale interactive data exploration. It is a purpose-built SQL surface for this architecture, not just a thin SQL layer on top of Spark.
Databricks SQL enables high-concurrency BI and interactive analytics directly on Delta/Iceberg tables in open storage. Crucially, it eliminates the need for a separate data loading step, distribution key planning, and an isolated governance system. It achieves this through several key innovations.
Photon execution engine: Photon is a native vectorized execution engine written in C++, purpose-built for SQL on Delta/Iceberg. Databricks SQL with Photon has continued to deliver year-over-year performance gains across BI, ETL, and ad-hoc workloads (see Databricks' 2025 DBSQL performance review). Customers migrating from legacy cloud data warehouses typically see multi-x reductions in query latency on aggregation- and join-heavy workloads, though results vary by query profile and data layout. For example, PepsiCo built an enterprise-grade Data Intelligence Platform on Databricks - unifying BI, ML, and AI workloads on a single Unity Catalog-governed Lakehouse and replacing fragmented legacy systems (see PepsiCo case study).
Liquid Clustering: No upfront key design. Most cloud data warehouses require you to define distribution keys, sort keys, or cluster keys before you know your query patterns. When patterns change, you rebuild tables. Some vendors have automatic clustering which helps but still requires predefined cluster keys upfront. DBSQL Liquid Clustering takes a different approach: no upfront key definition. The system observes actual query patterns and reorganizes data incrementally in the background. As workloads evolve, the layout adapts without table rebuilds, without schema changes, without downtime.
Predictive Optimisation: Automated compaction, statistics upkeep, and clustering run continuously in the background. Existing queries get faster over time as a byproduct of normal platform operation not as the result of a manual tuning cycle.
The difference in these two architectures as well as their underlying engines and how they interact with other upstream and downstream components leads to different results, which are summarized in the table below.
|
Dimension |
Dedicated OLAP Stack |
Databricks SQL on Open Lakehouse |
|
Data storage |
Proprietary internal format; full performance requires loading a copy into the warehouse. Some offer open-table reads but with write/feature restrictions |
Open Delta/Iceberg Lake on your object store.Same files for ETL, ML, BI. |
|
Query engine |
Vectorized, columnar (varies by vendor) |
Photon vectorized C++ engine |
|
Layout tuning |
Distribution, sort, or cluster keys; predefined upfront, manual to change |
Liquid Clustering - no upfront keys; adapts to actual query patterns; no rebuilds |
|
Governance |
Warehouse catalog governs warehouse; Data lake has its own; ML/AI has a third layer |
Unity Catalog - Open & spans ETL, SQL, ML, and AI on the same tables |
|
AI in SQL |
Native AI SQL functions in some warehouses; model roster locked to what the vendor hosts |
ai_query() to any model endpoint, OpenAI-compatible APIs, self-hosted, or Databricks FMAPI. Specialized AI functions for parsing, classification, extraction, translation, PII. MLflow-registered models callable from SQL- all on the same Delta/Iceberg tables |
|
ML / AI Integration |
Copy-out to a separate ML platform, or in-DWH ML restricted to vendor-provided algorithms and runtimes. Capped by the warehouse's compute envelope - ML cannot scale independently of BI, so training and inference compete with dashboard concurrency. |
MLflow + Model Serving + FMAPI on same Delta/Iceberg tables, no copy |
|
Lake-to-BI load |
Required for full performance; open-table read paths exist but with restrictions |
Delta/Iceberg queries in place - no load step needed |
|
Data freshness |
Refresh cadence of sync pipeline |
Direct Delta/Iceberg queries; streaming tables for near-real time. |
|
BI / Dashboarding |
Separate cloud service; its own storage, governance, billing |
Bring your own BI (Power BI, Tableau, Looker, etc.) on the same governed tables - or use AI/BI Dashboards and Genie natively. No additional license fees, full Unity Catalog governance, and a built-in agentic experience for analysts (natural-language Q&A, conversational drill-down, AI-assisted explanations). |
|
Data Sharing |
Restricted to same-vendor consumers |
Delta Sharing — open protocol; broad consumer support (Power BI, Tableau, pandas, Spark, partners). |
Run your actual dashboard queries against DBSQL Serverless before making any platform decision. Databricks publishes TPC-DS benchmark results, and you can benchmark your own workload on real data in hours.
Then evaluate on four dimensions:
If two or more of these apply, benchmark your real workload against DBSQL. The data warehouse migration cost is almost always lower than the compounding operational cost of running an architecture that doesn’t scale with your business needs and user demands.
The standalone cloud data warehouse made sense when lake and warehouse served genuinely different purposes. As those purposes converge - and as more workloads want access to the same governed data - the question is whether maintaining the separation is still paying for itself.
Run your most expensive OLAP query against DBSQL Serverless this week. Databricks Free Edition is free to start. Test real queries on real data, and let the results guide your decision.
<Related Databricks Blogs>
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.