Databricks Community

brickster_2018 · ‎06-23-2021

I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?

aladda · ‎06-23-2021

The primary benefit with AutoLoader would be the abstraction of checkpointing what data has already been processed successfully vs what needs processing. For your use case the File Notification method would work well - https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#file-discovery-modes. In addition there are some other benefits of Auto Loader around schema inference and evolution that you may benefit from as well depending on your use case - https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#configuration

And finally there's the future proofing of your pipelines with enhancements to Auto Loader that you can benefit from as well

brickster_2018 · ‎06-23-2021

That makes sense @Anand Ladda !

One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files. From customers we have seen the usage of Delta table for checkpointing the source files details is not the efficient way. A database that ensures faster retrieval and insertion is needed. Hence the Auto-loader has improved checkpointing using RocksBD. This will have direct performance improvement on the streaming queries.

Some of the issues faced in S3-SQS and addressed in Auto loader are below: