cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Problems and questions with deploying Lakeflow Declarative Pipeline using Databricks Bundles

AlbertWang
Valued Contributor

 

Hi all,

I met some problems and have some questions about deploying Lakeflow Declarative Pipeline using Databricks Bundles. Could anyone kindly help?

Below is my current bundle resource file for the pipeline:

 

resources:
  pipelines:
    dbr_d365_crm_pipeline:
      name: dbr_d365_crm_pipeline
      libraries:
        - file:
            path: ../src/pipeline/transformations/**
      clusters:
        - label: default
          aws_attributes: {}
          node_type_id: Standard_D4ads_v5
          driver_node_type_id: Standard_D4ads_v5
          num_workers: 0
      configuration:
        env: ${bundle.target}
        tables_config: ${var.tables_config}
      catalog: ag_dbr_ctlg_silver_${bundle.environment}
      schema: d365_crm
      continuous: false
      photon: false
      development: ${var.is_dev}
      edition: ADVANCED
      channel: CURRENT
      serverless: false

 

1, The field `schema` does not work. The document says:

AlbertWang_0-1754014007933.png

However, when I run `databricks bundle deploy`, I got the warning "unknown field: schema" and the error "The target schema field is required for UC pipelines". After change `schema` to `target`, the deploy works.

2, Even though I set `num_workers` to `0`, the deployed pipeline's Cluster mode is still set to "Enhanced autoscaling", and default to 1 ~ 5 workers. I don't know how to configure the pipeline's Cluster mode to "Fixed size" with "0" worker using Bundles.

AlbertWang_1-1754014301014.png

3, When I create the pipeline manually on UI, I can set the pipeline root folder, while I cannot do it via deploying using Bundles.

4, When I create the pipeline manually on UI, I can set the pipeline's Source code to a folder, and the corresponding YAML code shows:

 

libraries:
  - glob:
      include: /Workspace/Users/xxx/xxx/xxx/transformations/**

 

However, I cannot use `glob` in the Bundles pipeline resource file. I can only use `file` as shown in above code.

5, When I create a pipeline manually on UI, the pipeline accepts the following code:

 

def create_scd2_table(view_name, scd2_table_name, keys, sequence_by):
    dlt.create_streaming_table(f"{catalog_silver}.{schema}.{scd2_table_name}")
    dlt.create_auto_cdc_flow(
        target=f"{catalog_silver}.{schema}.{scd2_table_name}",
        source=view_name,
        keys=keys,
        sequence_by=col(sequence_by),
        stored_as_scd_type = 2
    )

 

And

 

def create_materialized_view(scd2_table_name, scd2_materialized_view_name):
    @Dlt.table(name = f"{catalog_gold}.{schema}.{scd2_materialized_view_name}")
    def mv():
        return dlt.read(f"{catalog_silver}.{schema}.{scd2_table_name}") \
                .withColumn("is_current", col("__END_AT").isNull()) \
                    .withColumn("__END_AT",
                        when(
                            col("__END_AT").isNull(),
                            lit(MAX_END_AT)
                        ).otherwise(col("__END_AT"))
                    )

 

That means, I can customize where to put the streaming tables and materialized views (in which UC catalog/schema). However, the pipeline deployed via Bundles does not support these features. I cannot define the catalog and schema of streaming tables and materialized views. They must be created under the pipeline's catalog and schema.

Can anyone help?

Thank you.

Regards,

Albert

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @AlbertWang ,

Cool that most of the issues has been resolved by upgrading DAB to newer version. Regarding last error, it's a bit weird. It should work. Check if you have everything configured according to below article:

szymon_dybczak_0-1754042864830.png

Publish to Multiple Catalogs and Schemas from a Single DLT Pipeline | Databricks Blog

So, make sure that you're using schema (not target) in your pipeline. Also, in below thread one user suggested to check if DPM setting is enabled in your pipeline. It's worth checking.


Solved: Delta Live Tables: dynamic schema - Databricks Community - 57626

szymon_dybczak_1-1754042999630.png

 



In your pipeline setting you should have pipelines.enableDPMForExistingPipeline enabled to true

Enable the default publishing mode in a pipeline - Azure Databricks | Microsoft Learn

View solution in original post

5 REPLIES 5

szymon_dybczak
Esteemed Contributor III

Hi @AlbertWang ,

I think some of those issues could be related to your databricks assets bundle version. For example the glob thing is in Beta. It could be available in UI, but not in your version of databricks cli.

szymon_dybczak_0-1754036843880.png

The same applies for root_path:

szymon_dybczak_1-1754037087382.png

 

As of num_of_workers issue. A multi-node compute resource can't be scaled to 0 workers. Use single node compute instead. (Compute configuration reference | Databricks Documentation)


Maybe you have outdated databricks cli version? Then it would also explain error with unknown schema field . For you outdated cli version this field would be unknown.

Thank you for your replay, szymon_dybczak.

After upgrading my Databricks CLIs, I could configure `schema`, `glob`, and `root_path`. I also figured out how to configure single-node cluster.

However, I still cannot figure out the reason the following problem.

 

5, When I create a pipeline manually on UI, the pipeline accepts the following code:

 

def create_scd2_table(view_name, scd2_table_name, keys, sequence_by):
    dlt.create_streaming_table(f"{catalog_silver}.{schema}.{scd2_table_name}")
    dlt.create_auto_cdc_flow(
        target=f"{catalog_silver}.{schema}.{scd2_table_name}",
        source=view_name,
        keys=keys,
        sequence_by=col(sequence_by),
        stored_as_scd_type = 2
    )

 

And

 

def create_materialized_view(scd2_table_name, scd2_materialized_view_name):
    @Dlt.table(name = f"{catalog_gold}.{schema}.{scd2_materialized_view_name}")
    def mv():
        return dlt.read(f"{catalog_silver}.{schema}.{scd2_table_name}") \
                .withColumn("is_current", col("__END_AT").isNull()) \
                    .withColumn("__END_AT",
                        when(
                            col("__END_AT").isNull(),
                            lit(MAX_END_AT)
                        ).otherwise(col("__END_AT"))
                    )

 

That means, I can customize where to put the streaming tables and materialized views (in which UC catalog/schema). However, the pipeline deployed via Bundles does not support these features. I cannot define the catalog and schema of streaming tables and materialized views. They must be created under the pipeline's catalog and schema.

szymon_dybczak
Esteemed Contributor III

Hi @AlbertWang ,

Cool that most of the issues has been resolved by upgrading DAB to newer version. Regarding last error, it's a bit weird. It should work. Check if you have everything configured according to below article:

szymon_dybczak_0-1754042864830.png

Publish to Multiple Catalogs and Schemas from a Single DLT Pipeline | Databricks Blog

So, make sure that you're using schema (not target) in your pipeline. Also, in below thread one user suggested to check if DPM setting is enabled in your pipeline. It's worth checking.


Solved: Delta Live Tables: dynamic schema - Databricks Community - 57626

szymon_dybczak_1-1754042999630.png

 



In your pipeline setting you should have pipelines.enableDPMForExistingPipeline enabled to true

Enable the default publishing mode in a pipeline - Azure Databricks | Microsoft Learn

I really appreciate your kind help, szymon_dybczak!

After using `schema`, everything works now.

szymon_dybczak
Esteemed Contributor III

Great, really happy that it worked for you. Thanks for accepting answer as a solution!