โ07-31-2025 09:29 PM
Hi all,
I met some problems and have some questions about deploying Lakeflow Declarative Pipeline using Databricks Bundles. Could anyone kindly help?
Below is my current bundle resource file for the pipeline:
resources:
pipelines:
dbr_d365_crm_pipeline:
name: dbr_d365_crm_pipeline
libraries:
- file:
path: ../src/pipeline/transformations/**
clusters:
- label: default
aws_attributes: {}
node_type_id: Standard_D4ads_v5
driver_node_type_id: Standard_D4ads_v5
num_workers: 0
configuration:
env: ${bundle.target}
tables_config: ${var.tables_config}
catalog: ag_dbr_ctlg_silver_${bundle.environment}
schema: d365_crm
continuous: false
photon: false
development: ${var.is_dev}
edition: ADVANCED
channel: CURRENT
serverless: false
1, The field `schema` does not work. The document says:
However, when I run `databricks bundle deploy`, I got the warning "unknown field: schema" and the error "The target schema field is required for UC pipelines". After change `schema` to `target`, the deploy works.
2, Even though I set `num_workers` to `0`, the deployed pipeline's Cluster mode is still set to "Enhanced autoscaling", and default to 1 ~ 5 workers. I don't know how to configure the pipeline's Cluster mode to "Fixed size" with "0" worker using Bundles.
3, When I create the pipeline manually on UI, I can set the pipeline root folder, while I cannot do it via deploying using Bundles.
4, When I create the pipeline manually on UI, I can set the pipeline's Source code to a folder, and the corresponding YAML code shows:
libraries:
- glob:
include: /Workspace/Users/xxx/xxx/xxx/transformations/**
However, I cannot use `glob` in the Bundles pipeline resource file. I can only use `file` as shown in above code.
5, When I create a pipeline manually on UI, the pipeline accepts the following code:
def create_scd2_table(view_name, scd2_table_name, keys, sequence_by):
dlt.create_streaming_table(f"{catalog_silver}.{schema}.{scd2_table_name}")
dlt.create_auto_cdc_flow(
target=f"{catalog_silver}.{schema}.{scd2_table_name}",
source=view_name,
keys=keys,
sequence_by=col(sequence_by),
stored_as_scd_type = 2
)
And
def create_materialized_view(scd2_table_name, scd2_materialized_view_name):
@Dlt.table(name = f"{catalog_gold}.{schema}.{scd2_materialized_view_name}")
def mv():
return dlt.read(f"{catalog_silver}.{schema}.{scd2_table_name}") \
.withColumn("is_current", col("__END_AT").isNull()) \
.withColumn("__END_AT",
when(
col("__END_AT").isNull(),
lit(MAX_END_AT)
).otherwise(col("__END_AT"))
)
That means, I can customize where to put the streaming tables and materialized views (in which UC catalog/schema). However, the pipeline deployed via Bundles does not support these features. I cannot define the catalog and schema of streaming tables and materialized views. They must be created under the pipeline's catalog and schema.
Can anyone help?
Thank you.
Regards,
Albert
โ08-01-2025 03:10 AM - edited โ08-01-2025 03:14 AM
Hi @AlbertWang ,
Cool that most of the issues has been resolved by upgrading DAB to newer version. Regarding last error, it's a bit weird. It should work. Check if you have everything configured according to below article:
Publish to Multiple Catalogs and Schemas from a Single DLT Pipeline | Databricks Blog
So, make sure that you're using schema (not target) in your pipeline. Also, in below thread one user suggested to check if DPM setting is enabled in your pipeline. It's worth checking.
Solved: Delta Live Tables: dynamic schema - Databricks Community - 57626
In your pipeline setting you should have pipelines.enableDPMForExistingPipeline enabled to true
Enable the default publishing mode in a pipeline - Azure Databricks | Microsoft Learn
โ08-01-2025 01:33 AM - edited โ08-01-2025 01:35 AM
Hi @AlbertWang ,
I think some of those issues could be related to your databricks assets bundle version. For example the glob thing is in Beta. It could be available in UI, but not in your version of databricks cli.
The same applies for root_path:
As of num_of_workers issue. A multi-node compute resource can't be scaled to 0 workers. Use single node compute instead. (Compute configuration reference | Databricks Documentation)
Maybe you have outdated databricks cli version? Then it would also explain error with unknown schema field . For you outdated cli version this field would be unknown.
โ08-01-2025 02:47 AM
Thank you for your replay, szymon_dybczak.
After upgrading my Databricks CLIs, I could configure `schema`, `glob`, and `root_path`. I also figured out how to configure single-node cluster.
However, I still cannot figure out the reason the following problem.
5, When I create a pipeline manually on UI, the pipeline accepts the following code:
def create_scd2_table(view_name, scd2_table_name, keys, sequence_by): dlt.create_streaming_table(f"{catalog_silver}.{schema}.{scd2_table_name}") dlt.create_auto_cdc_flow( target=f"{catalog_silver}.{schema}.{scd2_table_name}", source=view_name, keys=keys, sequence_by=col(sequence_by), stored_as_scd_type = 2 )
And
def create_materialized_view(scd2_table_name, scd2_materialized_view_name): @Dlt.table(name = f"{catalog_gold}.{schema}.{scd2_materialized_view_name}") def mv(): return dlt.read(f"{catalog_silver}.{schema}.{scd2_table_name}") \ .withColumn("is_current", col("__END_AT").isNull()) \ .withColumn("__END_AT", when( col("__END_AT").isNull(), lit(MAX_END_AT) ).otherwise(col("__END_AT")) )
That means, I can customize where to put the streaming tables and materialized views (in which UC catalog/schema). However, the pipeline deployed via Bundles does not support these features. I cannot define the catalog and schema of streaming tables and materialized views. They must be created under the pipeline's catalog and schema.
โ08-01-2025 03:10 AM - edited โ08-01-2025 03:14 AM
Hi @AlbertWang ,
Cool that most of the issues has been resolved by upgrading DAB to newer version. Regarding last error, it's a bit weird. It should work. Check if you have everything configured according to below article:
Publish to Multiple Catalogs and Schemas from a Single DLT Pipeline | Databricks Blog
So, make sure that you're using schema (not target) in your pipeline. Also, in below thread one user suggested to check if DPM setting is enabled in your pipeline. It's worth checking.
Solved: Delta Live Tables: dynamic schema - Databricks Community - 57626
In your pipeline setting you should have pipelines.enableDPMForExistingPipeline enabled to true
Enable the default publishing mode in a pipeline - Azure Databricks | Microsoft Learn
โ08-01-2025 03:56 AM
I really appreciate your kind help, szymon_dybczak!
After using `schema`, everything works now.
โ08-01-2025 03:58 AM
Great, really happy that it worked for you. Thanks for accepting answer as a solution!
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now