cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why should I move to Auto-loader?

brickster_2018
Esteemed Contributor
Esteemed Contributor

I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?

2 REPLIES 2

aladda
Honored Contributor II
Honored Contributor II

The primary benefit with AutoLoader would be the abstraction of checkpointing what data has already been processed successfully vs what needs processing. For your use case the File Notification method would work well - https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#file-discovery-modes. In addition there are some other benefits of Auto Loader around schema inference and evolution that you may benefit from as well depending on your use case - https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#configuration

And finally there's the future proofing of your pipelines with enhancements to Auto Loader that you can benefit from as well

brickster_2018
Esteemed Contributor
Esteemed Contributor

That makes sense @Anand Ladda​ !

One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files. From customers we have seen the usage of Delta table for checkpointing the source files details is not the efficient way. A database that ensures faster retrieval and insertion is needed. Hence the Auto-loader has improved checkpointing using RocksBD. This will have direct performance improvement on the streaming queries.

Some of the issues faced in S3-SQS and addressed in Auto loader are below:

  • Latency in starting the streaming query
  • Streaming query pause every one hour for a long time
  • Synchronous fetching and deleting causing issues.
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!