cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Error updating tables in DLT

Radix95
New Contributor II

I'm working on a Delta Live Tables (DLT) pipeline in Databricks Serverless mode.

I receive a stream of data from Event Hubs, where each incoming record contains a unique identifier (uuid) along with some attributes (code1, code2).

My goal is to update an existing table (data_event) using these records: if a record with the same uuid already exists (matching between uuid log and table record uuid), I want to update its values (code1, code2, and fill the column timestamp_update with the current time of the table update). 

 

 

schema_update = StructType([
    StructField("uuid", StringType(), False),
    StructField("code1", StringType(), True),
    StructField("code2", StringType(), True)
])

@dlt.view
def raw_data_view():
    df_raw = (
        spark.readStream
             .format("kafka")
             .options(**KAFKA_OPTIONS)
             .load()
    )

    df_parsed = (
        df_raw.selectExpr("CAST(value AS STRING) AS json_data")
              .select(from_json(col("json_data"), schema_update).alias("data"))
              .select("data.*")
              .withColumn("timestamp_update", current_timestamp())
    )
    return df_parsed

dlt.apply_changes(
    target="data_event",         
    source="raw_data_view",          
    keys=["uuid"],
    sequence_by="timestamp_update",
    stored_as_scd_type=1
)

 

 

 I get this error:

 

 

com.databricks.pipelines.common.errors.DLTSparkException: [STREAMING_TARGET_NOT_DEFINED] Cannot found target table `test`.`testtables`.`data_event` for the APPLY CHANGES command. Target table `test`.`testtables`.`data_event` is not defined in the pipeline.

 

 



1 ACCEPTED SOLUTION

Accepted Solutions

You should only update the table in the same pipeline(DLT1) that creates it as that is the pipeline that maintains it. With the assumption that the table data_event is indeed created and maintained in another DLT pipeline, from what you've shared,  you should just move that read of the kafka source into  that pipeline. So everything is in the same DLT pipeline.  In cases where you have to have separate pipelines, you need to trigger 1 after the other (incase of batch mode) or run them continuously (in real time mode), And DLT1 should create a view with a read stream from the event table and finally update the data_event table. Hope this helps.

View solution in original post

3 REPLIES 3

Edthehead
Contributor III

All the tables that DLT writes to or updates needs to be managed by DLT. The reason is that these tables are streaming tables and hence DLT needs to manage the checkpointing. It also does the optimization for such tables. So in your scenario, you cannot just update another existing table. It should be defined and updated in the same pipeline. 

Radix95
New Contributor II

Thank you for your explanation! That makes a lot of sense, since DLT manages streaming tables and checkpointing, I now understand why I can’t just update an existing table outside the pipeline.
Given this, what would you recommend in my case?
Would it be best to modify those records within the same DLT pipeline that initially creates and populates the table? Or is there any workaround that would allow me to load and update the table in a separate DLT pipeline? I appreciate your insights! Thanks again for your help.

You should only update the table in the same pipeline(DLT1) that creates it as that is the pipeline that maintains it. With the assumption that the table data_event is indeed created and maintained in another DLT pipeline, from what you've shared,  you should just move that read of the kafka source into  that pipeline. So everything is in the same DLT pipeline.  In cases where you have to have separate pipelines, you need to trigger 1 after the other (incase of batch mode) or run them continuously (in real time mode), And DLT1 should create a view with a read stream from the event table and finally update the data_event table. Hope this helps.