Databricks Community

Radix95 · ‎02-13-2025

I'm working on a Delta Live Tables (DLT) pipeline in Databricks Serverless mode.

I receive a stream of data from Event Hubs, where each incoming record contains a unique identifier (uuid) along with some attributes (code1, code2).

My goal is to update an existing table (data_event) using these records: if a record with the same uuid already exists (matching between uuid log and table record uuid), I want to update its values (code1, code2, and fill the column timestamp_update with the current time of the table update).

schema_update = StructType([
    StructField("uuid", StringType(), False),
    StructField("code1", StringType(), True),
    StructField("code2", StringType(), True)
])

@dlt.view
def raw_data_view():
    df_raw = (
        spark.readStream
             .format("kafka")
             .options(**KAFKA_OPTIONS)
             .load()
    )

    df_parsed = (
        df_raw.selectExpr("CAST(value AS STRING) AS json_data")
              .select(from_json(col("json_data"), schema_update).alias("data"))
              .select("data.*")
              .withColumn("timestamp_update", current_timestamp())
    )
    return df_parsed

dlt.apply_changes(
    target="data_event",         
    source="raw_data_view",          
    keys=["uuid"],
    sequence_by="timestamp_update",
    stored_as_scd_type=1
)

I get this error:

com.databricks.pipelines.common.errors.DLTSparkException: [STREAMING_TARGET_NOT_DEFINED] Cannot found target table `test`.`testtables`.`data_event` for the APPLY CHANGES command. Target table `test`.`testtables`.`data_event` is not defined in the pipeline.

Edthehead · ‎02-14-2025

You should only update the table in the same pipeline(DLT1) that creates it as that is the pipeline that maintains it. With the assumption that the table data_event is indeed created and maintained in another DLT pipeline, from what you've shared, you should just move that read of the kafka source into that pipeline. So everything is in the same DLT pipeline. In cases where you have to have separate pipelines, you need to trigger 1 after the other (incase of batch mode) or run them continuously (in real time mode), And DLT1 should create a view with a read stream from the event table and finally update the data_event table. Hope this helps.

View solution in original post

Edthehead · ‎02-14-2025

All the tables that DLT writes to or updates needs to be managed by DLT. The reason is that these tables are streaming tables and hence DLT needs to manage the checkpointing. It also does the optimization for such tables. So in your scenario, you cannot just update another existing table. It should be defined and updated in the same pipeline.

Radix95 · ‎02-14-2025

Thank you for your explanation! That makes a lot of sense, since DLT manages streaming tables and checkpointing, I now understand why I can’t just update an existing table outside the pipeline.
Given this, what would you recommend in my case?
Would it be best to modify those records within the same DLT pipeline that initially creates and populates the table? Or is there any workaround that would allow me to load and update the table in a separate DLT pipeline? I appreciate your insights! Thanks again for your help.

Edthehead · ‎02-14-2025

You should only update the table in the same pipeline(DLT1) that creates it as that is the pipeline that maintains it. With the assumption that the table data_event is indeed created and maintained in another DLT pipeline, from what you've shared, you should just move that read of the kafka source into that pipeline. So everything is in the same DLT pipeline. In cases where you have to have separate pipelines, you need to trigger 1 after the other (incase of batch mode) or run them continuously (in real time mode), And DLT1 should create a view with a read stream from the event table and finally update the data_event table. Hope this helps.

Databricks Community

Error updating tables in DLT

Photos

Join Us as a Local Community Builder!

Virtual Learning Festival: 9 April - 30 April

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks Community Champion - March 2025 - Takuya Omi