cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Trigger Events in data pipeline

sanjay
Valued Contributor II

Hi,

I am running datapipeline in databrick using matillion architecture. I am facing inconsistent events in silver to gold layer in case any row deleted/updated from a partition. Let me explain with example.

e.g. I have data in silver layer with partition on department id & joining date. If lets assume there are 3 employees joined in dept 1 and joining date as 01 Oct 2023. So this data is available in silver layer. Now, If am updating one of the employee record, then events are generated for all the data in that partition in silver to gold layer i.e. am getting all 3 records as change even if updates are done on single record.

Here is my code

(spark.readStream.format("delta")
.option("useNotification","true")
.option("includeExistingFiles","true")
.option("allowOverwrites",True)
.option("ignoreMissingFiles",True)
.option("ignoreChanges","true")
.option("maxFilesPerTrigger", 100)
.load(silver_path)
.writeStream
.queryName("SilverGoldStream")
.option("checkpointLocation", gold_checkpoint_path)
.trigger(once=True)
.foreachBatch(foreachBatchFunction)
.start()
.awaitTermination()
)

Appreciate any help here.

Regards,

Sanjay

2 ACCEPTED SOLUTIONS

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @sanjayThe behaviour you're experiencing is due to the way Delta Lake handles updates. When you update a record in a partition, Delta Lake treats this as a delete followed by an insert. This is because Delta Lake is built on top of Parquet, which is an immutable columnar format. As a result, when you update a record, Delta Lake needs to rewrite the entire partition. This is why you're seeing events for all the records in the partition, even though you're only updating one record. 

To resolve this issue, you can consider a few reasons:

1. **Minimize updates**: Try to structure your pipeline to minimise the number of updates. This could involve, for example, appending new data instead of updating existing records 

2. **Repartition your data**: If you're frequently updating records within the same partition, you might want to consider repartitioning your data. By choosing a different column for partitioning, you can reduce the amount of data that needs to be rewritten when an update occurs.

3. **Optimize your Delta Lake table**: Delta Lake provides an OPTIMIZE command that you can use to compact small files into larger ones, which can help improve the performance of your queries and reduce the impact of updates.

View solution in original post

Kaniz
Community Manager
Community Manager

Hi @sanjay, If you have nested partitions on department and date, such as finance->09 and finance->10, and you update one record in finance->09, it will not update the partition finance->10. Each partition is independent of the other, and updating one partition will not affect the others.

 

Regarding your second question, having smaller partitions can reduce the impact of updates. However, it is important to note that having too many partitions can also lead to performance issues. The maximum number of partitions you can have depends on the database management system you are using and the hardware resources available. It is best to consult the documentation of your database management system to determine the optimal ....

View solution in original post

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @sanjayThe behaviour you're experiencing is due to the way Delta Lake handles updates. When you update a record in a partition, Delta Lake treats this as a delete followed by an insert. This is because Delta Lake is built on top of Parquet, which is an immutable columnar format. As a result, when you update a record, Delta Lake needs to rewrite the entire partition. This is why you're seeing events for all the records in the partition, even though you're only updating one record. 

To resolve this issue, you can consider a few reasons:

1. **Minimize updates**: Try to structure your pipeline to minimise the number of updates. This could involve, for example, appending new data instead of updating existing records 

2. **Repartition your data**: If you're frequently updating records within the same partition, you might want to consider repartitioning your data. By choosing a different column for partitioning, you can reduce the amount of data that needs to be rewritten when an update occurs.

3. **Optimize your Delta Lake table**: Delta Lake provides an OPTIMIZE command that you can use to compact small files into larger ones, which can help improve the performance of your queries and reduce the impact of updates.

sanjay
Valued Contributor II

Thank you Kaniz. 

Further queries on this.

1. If I have nested partitions e.g. on department & date, finance->09, finance->10 and if am updating one record in finance->09 then will then updates partition finance->10 as well

2. Is it good idea to have smaller partition to reduce impact of updates. What's maximum number of partitions I can have

Thanks,

Sanjay

Kaniz
Community Manager
Community Manager

Hi @sanjay, If you have nested partitions on department and date, such as finance->09 and finance->10, and you update one record in finance->09, it will not update the partition finance->10. Each partition is independent of the other, and updating one partition will not affect the others.

 

Regarding your second question, having smaller partitions can reduce the impact of updates. However, it is important to note that having too many partitions can also lead to performance issues. The maximum number of partitions you can have depends on the database management system you are using and the hardware resources available. It is best to consult the documentation of your database management system to determine the optimal ....

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.