bakselrud
New Contributor III

Ok, so after doing some investigation on the way to resolving my original question, I think we're getting some clarity after all.

Consider the following data frame that is ingested by DLT streaming pipeline:

dfMock = spark.sparkContext.parallelize([[1,0,2],[1,1,3]]). \

toDF(StructType([StructField("key", LongType(), False), StructField("seq", LongType(), False), StructField("data", LongType(), False)]))

Ingesting this for the very first time, using DLT type-1 updates, results in a successful run. Ingesting this second time fails.

Why?

Because there is no checkpointing associated with the input data. The moment a streaming pipeline is stopped and run again, it appears to re-process the same data and naturally fails, because it thinks that it sees an updates to the same keys.

What was that the designers of DLT pipeline were thinking regarding this originally? We're left to wonder.

Now, the answer at this point from Databricks should be this: "But of course, this is because you misunderstood how data processing using the streaming data sources work. In streaming data sources you have checkpoints that take care of the processed vs unprocessed records. Do your homework"

We did the homework and evaluated streaming read / streaming DLT pipeline. However what we found is that even in that case, the DLT pipeline fails as it does not recognize that during its stopping and re-starting it reads par of the input stream that it already seen in the previous run. This is the root of the problem.

Now, since I break out some of the issues to you, could you help me understand how to properly use DLT pipeline so it does not break on re-start? Is it not supposed to pick up from the checkpoint of the input stream that it has failed (or halted over) berfore?