cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Automate Lakeflow connect to ingest 300 tables not manually

muaaz
New Contributor III

I have data in PostgreSQL and I’m using Lakeflow Connect via UI to ingest it into Databricks streaming tables.

Currently, each Lakeflow Connect pipeline only allows connecting one PostgreSQL table. I have around 300 tables, and creating pipelines manually for each table is time-consuming.

I’m looking for a way to automate this process, where I can provide a PostgreSQL connection and table names (or a list/schema), and automatically generate and deploy the required Lakeflow Connect pipelines.

I explored Asset Bundles and YAML-based definitions, but it seems Lakeflow Connect resources are not fully supported there yet.

What would be a scalable or recommended approach to design this setup in Databricks?

2 ACCEPTED SOLUTIONS

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @muaaz ,

Still you can achieve that. You can use Databricks Automation Bundles (DABs) to implement dynamic behaviour:

resources:
  pipelines:
    gateway:
      name: <gateway-name>
      gateway_definition:
        connection_id: <connection-id>
        gateway_storage_catalog: <destination-catalog>
        gateway_storage_schema: <destination-schema>
        gateway_storage_name: <destination-schema>
      target: <destination-schema>
      catalog: <destination-catalog>

    pipeline_sqlserver:
      name: <pipeline-name>
      catalog: <target-catalog-1> # Location of the pipeline event log
      schema: <target-schema-1> # Location of the pipeline event log
      ingestion_definition:
        connection_name: <connection-name>
        objects:
          - table:
              source_schema: <source-schema-1>
              source_table: <source-table-1>
              destination_catalog: <target-catalog-1> # Location of this table
              destination_schema: <target-schema-1> # Location of this table
          - table:
              source_schema: <source-schema-2>
              source_table: <source-table-2>
              destination_catalog: <target-catalog-2> # Location of this table
              destination_schema: <target-schema-2> # Location of this table

Check following documentation for details:

Create multi-destination pipelines | Databricks on AWS

If my answer was helpful, please consider marking it as accepted solution.

View solution in original post

balajij8
Contributor III

You can seamlessly execute the things done via UI in the DABs. You can configure your multi table Lake flow pipelines using YAML configuration if you prefer configuration to ensure reproducibility. More details for Post gre sql ingestion here

You can manage Lakeflow Connect pipelines as code using Asset Bundles for sql server by adding few files like below and use similar approach for other databases

  • Workflow file that controls the frequency of data ingestion (sqlserver.yml).
variables:
  # Common variables used multiple places in the DAB definition.
  gateway_name:
    default: sqlserver01-gateway-pipeline
  dest_catalog:
    default: main
  dest_schema:
    default: sqlserver01

resources:
  pipelines:
    gateway:
      name: ${var.gateway_name}
      gateway_definition:
        connection_name: rebel
        gateway_storage_catalog: main
        gateway_storage_schema: sqlserver01
        gateway_storage_name: sqlserver01-gateway-pipeline
      catalog: main
      target: sqlserver01

    pipeline_sqlserver:
      name: sqlserver-ingestion-pipeline
      ingestion_definition:
        ingestion_gateway_id: ${resources.pipelines.gateway.id}
        objects:
          - schema:
              # Ingest all tables in the sqlserver01.dbo schema to main.dest_schema. The destination table name will be drivers, the same as it is on the source.
              source_catalog: sqlserver01
              source_schema: dbo
              destination_catalog: main
              destination_schema: sqlserver01
      target: sqlserver01
      catalog: main
  • Pipeline Job definition file (sqlserver_pipeline.yml).
resources:
  jobs:
    sqlserver_dab_job:
      name: sqlserver-ingestion-pipeline job

      trigger:
        periodic:
          interval: 8
          unit: HOURS

      email_notifications:
        on_failure:
          - user email

      tasks:
        - task_key: refresh_pipeline
          pipeline_task:
            pipeline_id: ${resources.pipelines.pipeline_sqlserver.id}

 

View solution in original post

6 REPLIES 6

balajij8
Contributor III

Configuring Databricks Lake flow Connect for PostgreSQL is a streamlined, multi-step seamless process and you can ingest multiple tables within a single pipeline.

You can follow below

Selecting Multiple Tables via the UI

In the pipeline creation wizard where you will select your tables in the Source step ("Specify what data to ingest" - 3rd step).

  • You can check the boxes for all the tables you want to include.

  • For each selected table, you can individually configure specific settings such as Primary Keys and History Tracking (SCD behavior). Ensure the post gres schema & tables are configured before creating a pipeline in Lakeflow Connect

Scalability & Limits

  • Table Limits: Databricks recommends configuring 250 or fewer tables per pipeline to ensure optimal performance and manageability. If you need to ingest more than 250 tables, you can split them across multiple pipelines grouping by domain or schema. More details here

  • Data Volume: There is no limit on the number of rows or columns supported within these tables.

YAML

You can configure your multi table Lake flow pipelines using YAML configuration if you prefer configuration to ensure reproducibility. More details here

muaaz
New Contributor III

Thanks @balajij8 for your reply.

My goal is to find a solution that works across all Lakeflow Connect connectors. For the proof of concept, I am using PostgreSQL, but the approach should ideally be connector agnostic.

We would prefer not to use the UI because we don't want to manually configure hundreds of tables and pipelines. Instead, we are looking for an Infrastructure-as-Code or YAML-based approach where connector configurations and Lakeflow Connect pipeline definitions can be managed declaratively.

Is there a way to define the connector configuration and Lakeflow Connect pipeline once, and then dynamically onboard multiple tables without manually creating through the UI?

Any guidance, examples, or recommended patterns would be greatly appreciated.

szymon_dybczak
Esteemed Contributor III

Hi @muaaz ,

Still you can achieve that. You can use Databricks Automation Bundles (DABs) to implement dynamic behaviour:

resources:
  pipelines:
    gateway:
      name: <gateway-name>
      gateway_definition:
        connection_id: <connection-id>
        gateway_storage_catalog: <destination-catalog>
        gateway_storage_schema: <destination-schema>
        gateway_storage_name: <destination-schema>
      target: <destination-schema>
      catalog: <destination-catalog>

    pipeline_sqlserver:
      name: <pipeline-name>
      catalog: <target-catalog-1> # Location of the pipeline event log
      schema: <target-schema-1> # Location of the pipeline event log
      ingestion_definition:
        connection_name: <connection-name>
        objects:
          - table:
              source_schema: <source-schema-1>
              source_table: <source-table-1>
              destination_catalog: <target-catalog-1> # Location of this table
              destination_schema: <target-schema-1> # Location of this table
          - table:
              source_schema: <source-schema-2>
              source_table: <source-table-2>
              destination_catalog: <target-catalog-2> # Location of this table
              destination_schema: <target-schema-2> # Location of this table

Check following documentation for details:

Create multi-destination pipelines | Databricks on AWS

If my answer was helpful, please consider marking it as accepted solution.

balajij8
Contributor III

You can seamlessly execute the things done via UI in the DABs. You can configure your multi table Lake flow pipelines using YAML configuration if you prefer configuration to ensure reproducibility. More details for Post gre sql ingestion here

You can manage Lakeflow Connect pipelines as code using Asset Bundles for sql server by adding few files like below and use similar approach for other databases

  • Workflow file that controls the frequency of data ingestion (sqlserver.yml).
variables:
  # Common variables used multiple places in the DAB definition.
  gateway_name:
    default: sqlserver01-gateway-pipeline
  dest_catalog:
    default: main
  dest_schema:
    default: sqlserver01

resources:
  pipelines:
    gateway:
      name: ${var.gateway_name}
      gateway_definition:
        connection_name: rebel
        gateway_storage_catalog: main
        gateway_storage_schema: sqlserver01
        gateway_storage_name: sqlserver01-gateway-pipeline
      catalog: main
      target: sqlserver01

    pipeline_sqlserver:
      name: sqlserver-ingestion-pipeline
      ingestion_definition:
        ingestion_gateway_id: ${resources.pipelines.gateway.id}
        objects:
          - schema:
              # Ingest all tables in the sqlserver01.dbo schema to main.dest_schema. The destination table name will be drivers, the same as it is on the source.
              source_catalog: sqlserver01
              source_schema: dbo
              destination_catalog: main
              destination_schema: sqlserver01
      target: sqlserver01
      catalog: main
  • Pipeline Job definition file (sqlserver_pipeline.yml).
resources:
  jobs:
    sqlserver_dab_job:
      name: sqlserver-ingestion-pipeline job

      trigger:
        periodic:
          interval: 8
          unit: HOURS

      email_notifications:
        on_failure:
          - user email

      tasks:
        - task_key: refresh_pipeline
          pipeline_task:
            pipeline_id: ${resources.pipelines.pipeline_sqlserver.id}

 

muaaz
New Contributor III

Thanks @balajij8 for your support. This YAML approach is working for me.

muaaz
New Contributor III

Thanks @szymon_dybczak for your support.