cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Lakeflow SDP (DLT) produce external tables, or only UC-managed

nidhin
New Contributor III

As I understand it, streaming tables and materialized views produced by Lakeflow Spark Declarative Pipelines (DLT) are always Unity Catalog managed tables , there's no LOCATION/path option on create_streaming_table or apply_changes.

Is that correct? A few questions:

Is there any supported way to make an SDP output a true external table?


If not, what are the recommended design patterns when you need the data as external — e.g. a designated

managed storage location at the catalog/schema level, a pipeline sink writing to an external path, or a downstream job that re-writes to CREATE TABLE ... LOCATION?


Any reason to prefer one pattern over another for a Gold SCD2 dimension?


Thanks!

2 REPLIES 2

balajij8
Contributor III

Hi, 

Streaming Tables & Materialized Views supported in SDP are a form of Managed Tables. More details here

create_streaming_table() has a path parameter but specifying it creates a managed table with a custom storage location (not an external table). It uses the managed storage location for the schema containing the table if its not set in the code.

SDP is designed exclusively for managed table output. You need a post-processing step to get external tables or any of below.

  • Downstream Job with CREATE EXTERNAL TABLE - You can create an external table on top of the gold table at the designated external location
  • Schema Level Managed Storage - You can use pre configured schema level paths if the goal is predictable storage location. Its still managed tables you can set at catalog/schema level but stored at designated location
  • SDP to Staging with External Final Write - You can use SDP to create managed staging tables and use a separate job with create external tables to read from staging with appropriate changes

Gold SCD 2 Dimension

You can stay with managed tables unless you have specific requirements as its the best option. It gives automatic optimization, faster queries & uc managed maintenance.
You can use external tables only if you have legacy systems requiring direct file access from cloud storage paths or data shared with non-UC systems. You can use Downstream Job approach if external tables are required.

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @nidhin,

What you’re saying is basically correct for a Unity Catalog-enabled Lakeflow Spark Declarative Pipelines setup. In that model, pipelines publish streaming tables and materialized views into the target catalog and schema, the data is stored using the managed storage location for the containing schema or catalog, and the docs explicitly say that the LOCATION property is not supported when defining the table. That means there isn’t a supported way to make the pipeline-managed output itself a true Unity Catalog external table with its own LOCATION clause. See Use Unity Catalog with pipelines.

So if the requirement is really "I want control over where the managed data lives," the recommended pattern is to set a managed storage location at the schema or catalog level and let the pipeline write there. That keeps you on the happy path for SDP while still giving you predictable storage placement. The same Unity Catalog pipelines doc above is the key reference for that behaviour.

If the requirement is "I need external systems to consume this dataset," I would usually not jump straight to rewriting the table as an external table. Databricks now recommends either external data access for streaming tables and materialized views for modern clients, or Compatibility Mode for older or path-oriented clients. External data access is the nicer option when the client supports the REST APIs because it avoids a full data copy and gives read-after-write consistency, whereas Compatibility Mode creates a read-only copy at a chosen location and is better when broad client compatibility matters more than immediacy or storage efficiency. The high-level guidance is also summarised in Access Databricks data using external systems.

If you truly need an actual external Delta table as the contract, I'd frame that as a downstream serving pattern rather than as the SDP-owned output. In other words, let SDP own the authoritative managed table, then populate a separate external table from it with a downstream job if you have a hard requirement for CREATE TABLE ... LOCATION. That keeps the pipeline aligned with the documented UC model and isolates the "external table" decision to a serving layer instead of your core transformation layer.

For a Gold SCD2 dimension, my default preference would be to keep the authoritative Gold table as the native UC-managed SDP output and only add an external-facing representation if there is a concrete downstream need. If users just need to read it externally, external data access is usually the best first choice, with Compatibility Mode as the fallback for older clients. I'd only choose a downstream rewrite into a separate external table when you specifically need cloud-path-based ownership or an external-table contract that other platforms depend on. This will give you the best balance of correctness, governance, and operational simplicity for an SCD2 dimension.

Hope this helps.

If this answer resolves your question, could you mark it as ā€œAccept as Solutionā€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***