Problems and questions with deploying Lakeflow Declarative Pipeline using Databricks Bundles

AlbertWang
Valued Contributor

 

Hi all,

I met some problems and have some questions about deploying Lakeflow Declarative Pipeline using Databricks Bundles. Could anyone kindly help?

Below is my current bundle resource file for the pipeline:

 

resources:
  pipelines:
    dbr_d365_crm_pipeline:
      name: dbr_d365_crm_pipeline
      libraries:
        - file:
            path: ../src/pipeline/transformations/**
      clusters:
        - label: default
          aws_attributes: {}
          node_type_id: Standard_D4ads_v5
          driver_node_type_id: Standard_D4ads_v5
          num_workers: 0
      configuration:
        env: ${bundle.target}
        tables_config: ${var.tables_config}
      catalog: ag_dbr_ctlg_silver_${bundle.environment}
      schema: d365_crm
      continuous: false
      photon: false
      development: ${var.is_dev}
      edition: ADVANCED
      channel: CURRENT
      serverless: false

 

1, The field `schema` does not work. The document says:

AlbertWang_0-1754014007933.png

However, when I run `databricks bundle deploy`, I got the warning "unknown field: schema" and the error "The target schema field is required for UC pipelines". After change `schema` to `target`, the deploy works.

2, Even though I set `num_workers` to `0`, the deployed pipeline's Cluster mode is still set to "Enhanced autoscaling", and default to 1 ~ 5 workers. I don't know how to configure the pipeline's Cluster mode to "Fixed size" with "0" worker using Bundles.

AlbertWang_1-1754014301014.png

3, When I create the pipeline manually on UI, I can set the pipeline root folder, while I cannot do it via deploying using Bundles.

4, When I create the pipeline manually on UI, I can set the pipeline's Source code to a folder, and the corresponding YAML code shows:

 

libraries:
  - glob:
      include: /Workspace/Users/xxx/xxx/xxx/transformations/**

 

However, I cannot use `glob` in the Bundles pipeline resource file. I can only use `file` as shown in above code.

5, When I create a pipeline manually on UI, the pipeline accepts the following code:

 

def create_scd2_table(view_name, scd2_table_name, keys, sequence_by):
    dlt.create_streaming_table(f"{catalog_silver}.{schema}.{scd2_table_name}")
    dlt.create_auto_cdc_flow(
        target=f"{catalog_silver}.{schema}.{scd2_table_name}",
        source=view_name,
        keys=keys,
        sequence_by=col(sequence_by),
        stored_as_scd_type = 2
    )

 

And

 

def create_materialized_view(scd2_table_name, scd2_materialized_view_name):
    @Dlt.table(name = f"{catalog_gold}.{schema}.{scd2_materialized_view_name}")
    def mv():
        return dlt.read(f"{catalog_silver}.{schema}.{scd2_table_name}") \
                .withColumn("is_current", col("__END_AT").isNull()) \
                    .withColumn("__END_AT",
                        when(
                            col("__END_AT").isNull(),
                            lit(MAX_END_AT)
                        ).otherwise(col("__END_AT"))
                    )

 

That means, I can customize where to put the streaming tables and materialized views (in which UC catalog/schema). However, the pipeline deployed via Bundles does not support these features. I cannot define the catalog and schema of streaming tables and materialized views. They must be created under the pipeline's catalog and schema.

Can anyone help?

Thank you.

Regards,

Albert