- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-27-2026 03:56 PM
Hi @Ashwin_DSA ,
Appreciate for the further clarification. Let's make this even clearer.
"the trigger hands your job a parameter payload with the updated table list and the most recent commit version"
This is a good thing but likely it cannot be used, especially the downstream has no idempotent sink and expect exactly-once delta. Imagine a job gets the current version, read the delta and failed to sink. So in practice maybe we cannot use it.
The initialization is slow hints it tries to go through all commit versions. And table update trigger returns commit version might be one of the reasons. If returning version or timestamp is not that useful, maybe table update trigger should not be designed as return that and go through all versions. At least user should be given an option to skip it. If user has a table as streaming sink, very likely it has many versions and new pipeline with table update trigger cannot get initialized. If I just deploy a new pipeline and monitor a table update, it doesn't make sense to go through all versions to find whether it should get triggered. The new pipeline should be triggered immediately or only check wether there is change after it is created.
We are aware of delta.deletedFileRetentionDuration = 1 hour may break the state. However we don't do time travel stuff. Everything we need is to read the table directly, and we will reset this one back after initialization is done.
"Worth noting that you can sidestep it entirely by reading the raw table's commits with Delta change data feed into the staging table as a streaming job, so the staging table is derived from raw rather than written in parallel with it. "
I think Delta Table has some design issues - it is super slow to get changes between 2 versions. It has checkpoint, so it is fast to get version snapshot at any timing point. However, it is super slow to get changes between 2 versions. Say I have a raw table to load from Kafka, and then I want to read it by batch. Either the batch is using CDF or by SS (with availableNow), getting the changes is super slow. The slowness has nothing related with data volume. For 15000 versions with just 15000 rows, it is still super slow, or in SS, even if it readStream and filter with where 1=2, it still needs 20~40 min to go through each version. For Delta Table, anything related with many versions is slow. The initialization failure is similar. SS is even worse than CDF because looks like it iterates the changes twice and do schema validation etc. and totally kills the availableNow performance.
Because of this we did Kafka -> Raw -> Staging -> Downstream. From Raw to Staging is a streaming job and we can get commit_version. Then we can create LC on commit_version and getting the delta changes easily. Downstream monitors Staging. However as staging is loaded by SS so it still has many versions. The initialization failure is actually from here. That is also the reason we don't need time travel for staging.
BTW, iceberg should not have this perf issues to read multi versions.
Thanks