Not sure what the concrete advantage there is for me to create a streaming table vs static one. In my case, I designed a table with a job that extracts the most lastest files from an s3 location and then appends them to a delta table. I set the job to run continuous. Change Feed arrives approximately every minute in S3, and my job takes about 20 to 30 seconds to process the microbatched feed. I keep an interactive cluster on all the time so to minimize spin up time.
if I were to switch to AWS SMS and SQS service then utilize Autoloader for the same table, what concrete advantage do I gain? It seems like the SMS and SQS is standard for streaming from AWS. But wouldnโt my process be sufficient? Every refresh (twice a minute) I use partitioning of s3 folders (year/month/day/timestamp.parquet) to extract only the latest files. Then spark.read.parquet() and bingo- the batch is quickly processed without any issues. I checkpoint every run to only filter for files since the last refresh.
i also manage schema evolution with my own internal code. In handle potential late arriving data via a record time column. It all works very well. So what is the advantage and reason I should not utilize the micro batching process with continuous trigger workflow in favor of read stream? Or are we essentially doing the same thing?