yesterday
I have data in PostgreSQL and Iโm using Lakeflow Connect via UI to ingest it into Databricks streaming tables.
Currently, each Lakeflow Connect pipeline only allows connecting one PostgreSQL table. I have around 300 tables, and creating pipelines manually for each table is time-consuming.
Iโm looking for a way to automate this process, where I can provide a PostgreSQL connection and table names (or a list/schema), and automatically generate and deploy the required Lakeflow Connect pipelines.
I explored Asset Bundles and YAML-based definitions, but it seems Lakeflow Connect resources are not fully supported there yet.
What would be a scalable or recommended approach to design this setup in Databricks?
14 hours ago - last edited 14 hours ago
Hi @muaaz ,
Still you can achieve that. You can use Databricks Automation Bundles (DABs) to implement dynamic behaviour:
resources:
pipelines:
gateway:
name: <gateway-name>
gateway_definition:
connection_id: <connection-id>
gateway_storage_catalog: <destination-catalog>
gateway_storage_schema: <destination-schema>
gateway_storage_name: <destination-schema>
target: <destination-schema>
catalog: <destination-catalog>
pipeline_sqlserver:
name: <pipeline-name>
catalog: <target-catalog-1> # Location of the pipeline event log
schema: <target-schema-1> # Location of the pipeline event log
ingestion_definition:
connection_name: <connection-name>
objects:
- table:
source_schema: <source-schema-1>
source_table: <source-table-1>
destination_catalog: <target-catalog-1> # Location of this table
destination_schema: <target-schema-1> # Location of this table
- table:
source_schema: <source-schema-2>
source_table: <source-table-2>
destination_catalog: <target-catalog-2> # Location of this table
destination_schema: <target-schema-2> # Location of this tableCheck following documentation for details:
Create multi-destination pipelines | Databricks on AWS
If my answer was helpful, please consider marking it as accepted solution.
14 hours ago
You can seamlessly execute the things done via UI in the DABs. You can configure your multi table Lake flow pipelines using YAML configuration if you prefer configuration to ensure reproducibility. More details for Post gre sql ingestion here
You can manage Lakeflow Connect pipelines as code using Asset Bundles for sql server by adding few files like below and use similar approach for other databases
variables:
# Common variables used multiple places in the DAB definition.
gateway_name:
default: sqlserver01-gateway-pipeline
dest_catalog:
default: main
dest_schema:
default: sqlserver01
resources:
pipelines:
gateway:
name: ${var.gateway_name}
gateway_definition:
connection_name: rebel
gateway_storage_catalog: main
gateway_storage_schema: sqlserver01
gateway_storage_name: sqlserver01-gateway-pipeline
catalog: main
target: sqlserver01
pipeline_sqlserver:
name: sqlserver-ingestion-pipeline
ingestion_definition:
ingestion_gateway_id: ${resources.pipelines.gateway.id}
objects:
- schema:
# Ingest all tables in the sqlserver01.dbo schema to main.dest_schema. The destination table name will be drivers, the same as it is on the source.
source_catalog: sqlserver01
source_schema: dbo
destination_catalog: main
destination_schema: sqlserver01
target: sqlserver01
catalog: mainresources:
jobs:
sqlserver_dab_job:
name: sqlserver-ingestion-pipeline job
trigger:
periodic:
interval: 8
unit: HOURS
email_notifications:
on_failure:
- user email
tasks:
- task_key: refresh_pipeline
pipeline_task:
pipeline_id: ${resources.pipelines.pipeline_sqlserver.id}
yesterday
Configuring Databricks Lake flow Connect for PostgreSQL is a streamlined, multi-step seamless process and you can ingest multiple tables within a single pipeline.
You can follow below
In the pipeline creation wizard where you will select your tables in the Source step ("Specify what data to ingest" - 3rd step).
You can check the boxes for all the tables you want to include.
For each selected table, you can individually configure specific settings such as Primary Keys and History Tracking (SCD behavior). Ensure the post gres schema & tables are configured before creating a pipeline in Lakeflow Connect
Table Limits: Databricks recommends configuring 250 or fewer tables per pipeline to ensure optimal performance and manageability. If you need to ingest more than 250 tables, you can split them across multiple pipelines grouping by domain or schema. More details here
Data Volume: There is no limit on the number of rows or columns supported within these tables.
You can configure your multi table Lake flow pipelines using YAML configuration if you prefer configuration to ensure reproducibility. More details here
15 hours ago
Thanks @balajij8 for your reply.
My goal is to find a solution that works across all Lakeflow Connect connectors. For the proof of concept, I am using PostgreSQL, but the approach should ideally be connector agnostic.
We would prefer not to use the UI because we don't want to manually configure hundreds of tables and pipelines. Instead, we are looking for an Infrastructure-as-Code or YAML-based approach where connector configurations and Lakeflow Connect pipeline definitions can be managed declaratively.
Is there a way to define the connector configuration and Lakeflow Connect pipeline once, and then dynamically onboard multiple tables without manually creating through the UI?
Any guidance, examples, or recommended patterns would be greatly appreciated.
14 hours ago - last edited 14 hours ago
Hi @muaaz ,
Still you can achieve that. You can use Databricks Automation Bundles (DABs) to implement dynamic behaviour:
resources:
pipelines:
gateway:
name: <gateway-name>
gateway_definition:
connection_id: <connection-id>
gateway_storage_catalog: <destination-catalog>
gateway_storage_schema: <destination-schema>
gateway_storage_name: <destination-schema>
target: <destination-schema>
catalog: <destination-catalog>
pipeline_sqlserver:
name: <pipeline-name>
catalog: <target-catalog-1> # Location of the pipeline event log
schema: <target-schema-1> # Location of the pipeline event log
ingestion_definition:
connection_name: <connection-name>
objects:
- table:
source_schema: <source-schema-1>
source_table: <source-table-1>
destination_catalog: <target-catalog-1> # Location of this table
destination_schema: <target-schema-1> # Location of this table
- table:
source_schema: <source-schema-2>
source_table: <source-table-2>
destination_catalog: <target-catalog-2> # Location of this table
destination_schema: <target-schema-2> # Location of this tableCheck following documentation for details:
Create multi-destination pipelines | Databricks on AWS
If my answer was helpful, please consider marking it as accepted solution.
14 hours ago
You can seamlessly execute the things done via UI in the DABs. You can configure your multi table Lake flow pipelines using YAML configuration if you prefer configuration to ensure reproducibility. More details for Post gre sql ingestion here
You can manage Lakeflow Connect pipelines as code using Asset Bundles for sql server by adding few files like below and use similar approach for other databases
variables:
# Common variables used multiple places in the DAB definition.
gateway_name:
default: sqlserver01-gateway-pipeline
dest_catalog:
default: main
dest_schema:
default: sqlserver01
resources:
pipelines:
gateway:
name: ${var.gateway_name}
gateway_definition:
connection_name: rebel
gateway_storage_catalog: main
gateway_storage_schema: sqlserver01
gateway_storage_name: sqlserver01-gateway-pipeline
catalog: main
target: sqlserver01
pipeline_sqlserver:
name: sqlserver-ingestion-pipeline
ingestion_definition:
ingestion_gateway_id: ${resources.pipelines.gateway.id}
objects:
- schema:
# Ingest all tables in the sqlserver01.dbo schema to main.dest_schema. The destination table name will be drivers, the same as it is on the source.
source_catalog: sqlserver01
source_schema: dbo
destination_catalog: main
destination_schema: sqlserver01
target: sqlserver01
catalog: mainresources:
jobs:
sqlserver_dab_job:
name: sqlserver-ingestion-pipeline job
trigger:
periodic:
interval: 8
unit: HOURS
email_notifications:
on_failure:
- user email
tasks:
- task_key: refresh_pipeline
pipeline_task:
pipeline_id: ${resources.pipelines.pipeline_sqlserver.id}
10 hours ago
Thanks @balajij8 for your support. This YAML approach is working for me.
10 hours ago
Thanks @szymon_dybczak for your support.