cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Why should I move to Auto-loader?

brickster_2018
Esteemed Contributor

I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?

2 REPLIES 2

aladda
Honored Contributor II

The primary benefit with AutoLoader would be the abstraction of checkpointing what data has already been processed successfully vs what needs processing. For your use case the File Notification method would work well - https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#file-discovery-modes. In addition there are some other benefits of Auto Loader around schema inference and evolution that you may benefit from as well depending on your use case - https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#configuration

And finally there's the future proofing of your pipelines with enhancements to Auto Loader that you can benefit from as well

brickster_2018
Esteemed Contributor

That makes sense @Anand Laddaโ€‹ !

One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files. From customers we have seen the usage of Delta table for checkpointing the source files details is not the efficient way. A database that ensures faster retrieval and insertion is needed. Hence the Auto-loader has improved checkpointing using RocksBD. This will have direct performance improvement on the streaming queries.

Some of the issues faced in S3-SQS and addressed in Auto loader are below:

  • Latency in starting the streaming query
  • Streaming query pause every one hour for a long time
  • Synchronous fetching and deleting causing issues.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group