topic Re: Error updating tables in DLT in Data Engineering

Error updating tables in DLT

Radix95 — Thu, 13 Feb 2025 14:15:17 GMT

I'm working on a Delta Live Tables (DLT) pipeline in Databricks Serverless mode.

I receive a stream of data from Event Hubs, where each incoming record contains a unique identifier (uuid) along with some attributes (code1, code2).

My goal is to update an existing table (data_event) using these records: if a record with the same uuid already exists (matching between uuid log and table record uuid), I want to update its values (code1, code2, and fill the column timestamp_update with the current time of the table update).

schema_update = StructType([ StructField("uuid", StringType(), False), StructField("code1", StringType(), True), StructField("code2", StringType(), True) ]) @dlt.view def raw_data_view(): df_raw = ( spark.readStream .format("kafka") .options(**KAFKA_OPTIONS) .load() ) df_parsed = ( df_raw.selectExpr("CAST(value AS STRING) AS json_data") .select(from_json(col("json_data"), schema_update).alias("data")) .select("data.*") .withColumn("timestamp_update", current_timestamp()) ) return df_parsed dlt.apply_changes( target="data_event", source="raw_data_view", keys=["uuid"], sequence_by="timestamp_update", stored_as_scd_type=1 )

I get this error:

com.databricks.pipelines.common.errors.DLTSparkException: [STREAMING_TARGET_NOT_DEFINED] Cannot found target table `test`.`testtables`.`data_event` for the APPLY CHANGES command. Target table `test`.`testtables`.`data_event` is not defined in the pipeline.

Re: Error updating tables in DLT

Edthehead — Fri, 14 Feb 2025 08:02:20 GMT

All the tables that DLT writes to or updates needs to be managed by DLT. The reason is that these tables are streaming tables and hence DLT needs to manage the checkpointing. It also does the optimization for such tables. So in your scenario, you cannot just update another existing table. It should be defined and updated in the same pipeline.

Re: Error updating tables in DLT

Radix95 — Fri, 14 Feb 2025 08:49:25 GMT

Thank you for your explanation! That makes a lot of sense, since DLT manages streaming tables and checkpointing, I now understand why I can’t just update an existing table outside the pipeline.
Given this, what would you recommend in my case?
Would it be best to modify those records within the same DLT pipeline that initially creates and populates the table? Or is there any workaround that would allow me to load and update the table in a separate DLT pipeline? I appreciate your insights! Thanks again for your help.

Re: Error updating tables in DLT

Edthehead — Fri, 14 Feb 2025 09:02:18 GMT

You should only update the table in the same pipeline(DLT1) that creates it as that is the pipeline that maintains it. With the assumption that the table data_event is indeed created and maintained in another DLT pipeline, from what you've shared, you should just move that read of the kafka source into that pipeline. So everything is in the same DLT pipeline. In cases where you have to have separate pipelines, you need to trigger 1 after the other (incase of batch mode) or run them continuously (in real time mode), And DLT1 should create a view with a read stream from the event table and finally update the data_event table. Hope this helps.