What is the difference between SQS in S3 and AutoLoader in databricks?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-26-2023 06:40 PM
I'm curious about the difference between using S3's SQS to set the queue url in spark's readStream option and AutoLoader reading from Cloudfiles.
I would also like some advice on which one is better to use in which situation.
(from a cost and performance perspective)
I'm planning to create data in real-time using complex operations such as JOIN on the data imported from real-time ingestion.
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2023 03:45 AM
Hello Folk,
Cloud files options provide you a flexibility to read the data from input source and process the files accordingly. In databricks you get a two options batch and streaming.
When you are using streaming using Autoloader it automatically detects the new files has been arrived in the source and it process automatically.
Please share your use case, whether your input source is AWS S3 bucket and what is the use of SQS ?
In my past experience my use case was to ingest the data from ADLS Gen 2 and to Bronze, Silver & Gold Layers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-02-2024 04:58 PM
input source is S3 bucket's object that created by AWS DMS. and I will use SQS to use it as a file arrival notification in a streaming pipeline. In Delta Live Tables, the production pipeline is constantly running and I want it to stop when a file arrives.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2023 04:56 AM
Using Spark readStream with SQS will be very similar to using CloudFiles with FileNotification mode (which also uses SQS on the backend).
CloudFiles comes with some additional options compared to native Spark readStream:
One benefit is that you can set a backfill interval that will regularly search for files that could potentially be skipped during the entire loading process (this is because Cloud providers do not guarantee 100% of files arriving in storage to be registered in the queue). If you are using your mechanism to send info to the queue, you might not need it, but please be aware that if you need 100% data quality.
There is often information that AutoLoader is like Spark readStream on steroids, where they improved the performance of the API to retrieve files from cloud storage. I think you would need to test it by setting up both AutoLoader with FileNotification and Spark streaming with SQS and try to check if you see the difference in performance. If AutoLoader runs faster, that means the job will finish quicker, and you can shut down the cluster faster.
If you plan to run this 24/7, then there might be no difference in cost.
From the perspective of complex operations, it has nothing to do with the loading technique, as both will result in a streaming dataframe, and your transformation should be executed in the same way.