Hi,
I am running datapipeline in databrick using matillion architecture. I am facing inconsistent events in silver to gold layer in case any row deleted/updated from a partition. Let me explain with example.
e.g. I have data in silver layer with partition on department id & joining date. If lets assume there are 3 employees joined in dept 1 and joining date as 01 Oct 2023. So this data is available in silver layer. Now, If am updating one of the employee record, then events are generated for all the data in that partition in silver to gold layer i.e. am getting all 3 records as change even if updates are done on single record.
Here is my code
(spark.readStream.format("delta")
.option("useNotification","true")
.option("includeExistingFiles","true")
.option("allowOverwrites",True)
.option("ignoreMissingFiles",True)
.option("ignoreChanges","true")
.option("maxFilesPerTrigger", 100)
.load(silver_path)
.writeStream
.queryName("SilverGoldStream")
.option("checkpointLocation", gold_checkpoint_path)
.trigger(once=True)
.foreachBatch(foreachBatchFunction)
.start()
.awaitTermination()
)
Appreciate any help here.
Regards,
Sanjay