- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-23-2026 04:34 PM
Hi @Ashwin_DSA ,
Thanks for reply. Our scenario is we have a Kafka source streaming data to raw table, and then downstream tasks consume the raw data. The streaming to raw is sparse at minutes level. If we create streaming job for downstream, it might use thousands of dollars a year for EC2 even if we use single node cluster, plus the DBU. As source is sparse, a streaming job is a waste here.
We want to run downstream tasks at minutes level. Obviously we cannot use all purpose cluster because of cold start. So we think we can use serverless + table update trigger. Is below understanding correct:
1. When using table update triggered by raw table, if data keeps coming, there is actually no big benefit for table update trigger+serverless vs. a minutes level cron+serverless right?
2. For table update trigger or file arrival trigger without file events, actually behind the scene, Databricks is doing poll. For table update trigger or file arrival trigger with file events, actually it is push model?
3. We didn't find specific latency number for trigger with vs. without file events. What latency num we can tell our stakeholders based on triggers?
4. The poll part or getting notification from file events are free. The resource usage is our cost.
5. We stream Kafka data to a raw table, and downstream monitor the raw table. We noticed the first time trigger needs a long initialization if raw has many commit versions (as it is sink from Kafka). However sometimes we get `The table 'xxx' has exceeded the maximum number of initial evaluations (20). This can occur if the table's delta log directory is large and contains many versions. Consider running VACUUM on the table to clean up old versions.` So looks like the trigger initialization checks early commit versions. If yes, why? Shouldn't be the new pipeline just runs immediately without checking early logs?