Databricks Community

AmanSehgal · ‎03-09-2022

I have a trigger in lambda that gets triggered when a new file arrives in S3. I want this file to be straightaway processed using a notebook to Upsert all the data into a delta table.

I'm looking for a solution with minimum latency.

Hubert-Dudek · ‎03-10-2022

There are two possible solution:

autoloader/cloudfiles, better with "File notification" queue to avoid unnecessary scans,

OR

from lambda sending post request to /api/2.1/jobs/run-now

Additionally in both solution it is important to have private link and access via role (to skip authentication).

In first additionally S3 have to be mounted in databricks. In first you can also use advantage of spark parallelism as multiple virtual machines will read and write in the same time.

In second if there is no any delay in trigger on S3, AWS lambda will be run quicker but than executing notebook through API will be slower as it can take dozens of seconds to run job.

Me would go for autloader if there are many files per hour.

If it is 1 file per hour or less I would go with job trigger through REST API.

View solution in original post

RKNutalapati · ‎03-09-2022

Hi @Aman Sehgal : I am not sure if you have already explored databricks Autoloader and the limitation for your use case. otherwise you can try using Autoloader and avoid multiple processes.

AmanSehgal · ‎03-09-2022

I've used autoloader.. And it works like a charm..

But I'm not sure about what should be the interval for trigger processing time.

I was thinking of it as trigger once only, and when the file arrives in S3, then the lambda will trigger the notebook for processing the file.

The thing is that files could arrive like 5 every minute or maybe once every 3 hours.. The frequency is not set..

But whenever the file arrives, it should be processed with minimum latency.

-werners- · ‎03-09-2022

basically you are looking for an event based trigger, where the event is the arrival of a new file.

It´s been a while since I worked on AWS but doesn´t Glue have a functionality like this?

On Azure I do exactly the same thing: when a file arrives in a certain location, a data pipeline starts containing a dbrx notebook.

If you want minimal latency, I suggest you use a pool with active workers, but that comes with a price ofc.

Hubert-Dudek · ‎03-10-2022