Why should I move to Auto-loader?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 03:54 PM
I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 04:03 PM
The primary benefit with AutoLoader would be the abstraction of checkpointing what data has already been processed successfully vs what needs processing. For your use case the File Notification method would work well - https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#file-discovery-modes. In addition there are some other benefits of Auto Loader around schema inference and evolution that you may benefit from as well depending on your use case - https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#configuration
And finally there's the future proofing of your pipelines with enhancements to Auto Loader that you can benefit from as well
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 10:26 PM
That makes sense @Anand Ladda !
One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files. From customers we have seen the usage of Delta table for checkpointing the source files details is not the efficient way. A database that ensures faster retrieval and insertion is needed. Hence the Auto-loader has improved checkpointing using RocksBD. This will have direct performance improvement on the streaming queries.
Some of the issues faced in S3-SQS and addressed in Auto loader are below:
- Latency in starting the streaming query
- Streaming query pause every one hour for a long time
- Synchronous fetching and deleting causing issues.

