Databricks Community

rt-slowth · ‎12-26-2023

I'm curious about the difference between using S3's SQS to set the queue url in spark's readStream option and AutoLoader reading from Cloudfiles.

I would also like some advice on which one is better to use in which situation.
(from a cost and performance perspective)

I'm planning to create data in real-time using complex operations such as JOIN on the data imported from real-time ingestion.

BR_DatabricksAI · ‎12-27-2023

Hello Folk,

Cloud files options provide you a flexibility to read the data from input source and process the files accordingly. In databricks you get a two options batch and streaming.

When you are using streaming using Autoloader it automatically detects the new files has been arrived in the source and it process automatically.

Please share your use case, whether your input source is AWS S3 bucket and what is the use of SQS ?

In my past experience my use case was to ingest the data from ADLS Gen 2 and to Bronze, Silver & Gold Layers

rt-slowth · ‎01-02-2024

input source is S3 bucket's object that created by AWS DMS. and I will use SQS to use it as a file arrival notification in a streaming pipeline. In Delta Live Tables, the production pipeline is constantly running and I want it to stop when a file arrives.

Wojciech_BUK · ‎12-27-2023

Using Spark readStream with SQS will be very similar to using CloudFiles with FileNotification mode (which also uses SQS on the backend).

CloudFiles comes with some additional options compared to native Spark readStream:

Common Auto Loader Options

One benefit is that you can set a backfill interval that will regularly search for files that could potentially be skipped during the entire loading process (this is because Cloud providers do not guarantee 100% of files arriving in storage to be registered in the queue). If you are using your mechanism to send info to the queue, you might not need it, but please be aware that if you need 100% data quality.

There is often information that AutoLoader is like Spark readStream on steroids, where they improved the performance of the API to retrieve files from cloud storage. I think you would need to test it by setting up both AutoLoader with FileNotification and Spark streaming with SQS and try to check if you see the difference in performance. If AutoLoader runs faster, that means the job will finish quicker, and you can shut down the cluster faster.

If you plan to run this 24/7, then there might be no difference in cost.

From the perspective of complex operations, it has nothing to do with the loading technique, as both will result in a streaming dataframe, and your transformation should be executed in the same way.

Databricks Community

What is the difference between SQS in S3 and AutoLoader in databricks?

Connect with Databricks Users in Your Area

Submit your feedback and win a $50 gift card!

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!