03-09-2022 04:11 PM
I have a trigger in lambda that gets triggered when a new file arrives in S3. I want this file to be straightaway processed using a notebook to Upsert all the data into a delta table.
I'm looking for a solution with minimum latency.
03-10-2022 05:45 AM
There are two possible solution:
OR
Additionally in both solution it is important to have private link and access via role (to skip authentication).
In first additionally S3 have to be mounted in databricks. In first you can also use advantage of spark parallelism as multiple virtual machines will read and write in the same time.
In second if there is no any delay in trigger on S3, AWS lambda will be run quicker but than executing notebook through API will be slower as it can take dozens of seconds to run job.
Me would go for autloader if there are many files per hour.
If it is 1 file per hour or less I would go with job trigger through REST API.
03-09-2022 08:25 PM
Hi @Aman Sehgal : I am not sure if you have already explored databricks Autoloader and the limitation for your use case. otherwise you can try using Autoloader and avoid multiple processes.
03-09-2022 10:18 PM
I've used autoloader.. And it works like a charm..
But I'm not sure about what should be the interval for trigger processing time.
I was thinking of it as trigger once only, and when the file arrives in S3, then the lambda will trigger the notebook for processing the file.
The thing is that files could arrive like 5 every minute or maybe once every 3 hours.. The frequency is not set..
But whenever the file arrives, it should be processed with minimum latency.
03-09-2022 10:58 PM
basically you are looking for an event based trigger, where the event is the arrival of a new file.
It´s been a while since I worked on AWS but doesn´t Glue have a functionality like this?
On Azure I do exactly the same thing: when a file arrives in a certain location, a data pipeline starts containing a dbrx notebook.
If you want minimal latency, I suggest you use a pool with active workers, but that comes with a price ofc.
03-10-2022 05:45 AM
There are two possible solution:
OR
Additionally in both solution it is important to have private link and access via role (to skip authentication).
In first additionally S3 have to be mounted in databricks. In first you can also use advantage of spark parallelism as multiple virtual machines will read and write in the same time.
In second if there is no any delay in trigger on S3, AWS lambda will be run quicker but than executing notebook through API will be slower as it can take dozens of seconds to run job.
Me would go for autloader if there are many files per hour.
If it is 1 file per hour or less I would go with job trigger through REST API.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group