Databricks Community

ashap551 · ‎03-25-2025

Not sure what the concrete advantage there is for me to create a streaming table vs static one. In my case, I designed a table with a job that extracts the most lastest files from an s3 location and then appends them to a delta table. I set the job to run continuous. Change Feed arrives approximately every minute in S3, and my job takes about 20 to 30 seconds to process the microbatched feed. I keep an interactive cluster on all the time so to minimize spin up time.

if I were to switch to AWS SMS and SQS service then utilize Autoloader for the same table, what concrete advantage do I gain? It seems like the SMS and SQS is standard for streaming from AWS. But wouldn’t my process be sufficient? Every refresh (twice a minute) I use partitioning of s3 folders (year/month/day/timestamp.parquet) to extract only the latest files. Then spark.read.parquet() and bingo- the batch is quickly processed without any issues. I checkpoint every run to only filter for files since the last refresh.

i also manage schema evolution with my own internal code. In handle potential late arriving data via a record time column. It all works very well. So what is the advantage and reason I should not utilize the micro batching process with continuous trigger workflow in favor of read stream? Or are we essentially doing the same thing?

Ajay-Pandey · ‎03-25-2025

@ashap551

You're essentially implementing a well-optimized micro-batching process, and functionally, it's very similar to what readStream() with Autoloader would do. However, there are some advantages to using Autoloader and a proper streaming table that might be worth considering.

Concrete Advantages of Streaming Tables & Autoloader in Your Case

Scalability & Efficiency
- Your current approach works well because you control partitioning and file listing manually. But as the number of files grows, spark.read.parquet() might experience performance degradation due to listing overhead.
- Autoloader eliminates the need for explicit listing by leveraging AWS SNS/SQS or its file notification mode, reducing metadata operations.
No Need for Explicit Checkpointing & File Filtering
- Right now, you're manually tracking the last processed file via checkpoints.
- With readStream(), Autoloader automatically tracks processed files and ensures no duplication without requiring explicit filtering logic.
True Streaming vs. Continuous Micro-Batching
- Even though your workflow runs every ~30 seconds, there's still a small gap where data is waiting to be processed.
- readStream() with Trigger.Once or Trigger.AvailableNow can reduce end-to-end latency since Spark processes new files immediately instead of waiting for the next scheduled batch.
Automatic Schema Evolution
- You mentioned handling schema evolution manually. Autoloader can simplify this with mergeSchema = true and Databricks' schema evolution capabilities.
Easier Integration with Delta Change Data Feed (CDF)
- If your use case evolves and you need CDF, streaming tables integrate more naturally with readStream() and writeStream().
Cost Optimization with Serverless Compute (Future Consideration)
- Since you’re keeping an interactive cluster always running, this might be expensive. With streaming tables, you could potentially move to Databricks Serverless Streaming or Photon, reducing costs.

Should You Switch?

If your current method is working well, there's no immediate need to switch. However, if you're expecting higher data volumes, schema changes, or lower-latency requirements, then using Autoloader and readStream() would provide more efficiency and automation.

Ajay Kumar Pandey