Re: Handling single table in multiple dlt pipeines

VZLA · ‎12-16-2024

Yes, your understanding is partially correct. Let me clarify:

Only one pipeline can own and manage a target table, including operations like schema evolution, maintenance, and refreshes, etc
When other pipelines are mentioned as "producing upstream data" it means they can generate or prepare data that can be consumed by the owning pipeline. These other pipelines do not directly append to the target table but instead write to intermediate or staging locations.

Now, addressing your specific question: Yes, this upstream data can be appended back to the same target table, but the append operation must happen through the owning pipeline. Other pipelines act as feeders.

Example:

Pipeline A (Owning the Table): Handles schema evolution, maintenance, and appends data from both its source and the staging table.

import dlt

@dlt.table(
    name="target_table",
    comment="This table is owned by Pipeline A."
)
def target_table():
    # Read data from its own source
    main_source = dlt.read_stream("source_stream")

    # Append staging data from Pipeline B
    staging_data = dlt.read("staging_table")

    return main_source.unionByName(staging_data, allowMissingColumns=True)

Pipeline B (Generating Upstream Data): Writes data to a staging table that Pipeline A reads.

import dlt

@dlt.table(
    name="staging_table",
    comment="Intermediate data for Pipeline A."
)
def staging_table():
    source_data = spark.readStream.format("delta").load("path_to_source")
    return source_data

Workflow Summary

Pipeline A owns and maintains the target_table. It consolidates data from its primary source and the staging_table_from_pipeline_b.
Pipeline B processes its data and writes to staging_table_from_pipeline_b. It does not directly interact with target_table to avoid ownership conflicts.