topic Re: What's the best way to run a databricks notebook from AWS Lambda ? in Data Engineering

What's the best way to run a databricks notebook from AWS Lambda ?

AmanSehgal — Thu, 10 Mar 2022 00:11:12 GMT

I have a trigger in lambda that gets triggered when a new file arrives in S3. I want this file to be straightaway processed using a notebook to Upsert all the data into a delta table.

I'm looking for a solution with minimum latency.

Re: What's the best way to run a databricks notebook from AWS Lambda ?

RKNutalapati — Thu, 10 Mar 2022 04:25:12 GMT

Hi @Aman Sehgal : I am not sure if you have already explored databricks Autoloader and the limitation for your use case. otherwise you can try using Autoloader and avoid multiple processes.

Re: What's the best way to run a databricks notebook from AWS Lambda ?

AmanSehgal — Thu, 10 Mar 2022 06:18:48 GMT

I've used autoloader.. And it works like a charm..

But I'm not sure about what should be the interval for trigger processing time.

I was thinking of it as trigger once only, and when the file arrives in S3, then the lambda will trigger the notebook for processing the file.

The thing is that files could arrive like 5 every minute or maybe once every 3 hours.. The frequency is not set..

But whenever the file arrives, it should be processed with minimum latency.

Re: What's the best way to run a databricks notebook from AWS Lambda ?

-werners- — Thu, 10 Mar 2022 06:58:15 GMT

basically you are looking for an event based trigger, where the event is the arrival of a new file.

It´s been a while since I worked on AWS but doesn´t Glue have a functionality like this?

On Azure I do exactly the same thing: when a file arrives in a certain location, a data pipeline starts containing a dbrx notebook.

If you want minimal latency, I suggest you use a pool with active workers, but that comes with a price ofc.

Re: What's the best way to run a databricks notebook from AWS Lambda ?

Hubert-Dudek — Thu, 10 Mar 2022 13:45:01 GMT

There are two possible solution:

autoloader/cloudfiles, better with "File notification" queue to avoid unnecessary scans,

from lambda sending post request to /api/2.1/jobs/run-now

Additionally in both solution it is important to have private link and access via role (to skip authentication).

In first additionally S3 have to be mounted in databricks. In first you can also use advantage of spark parallelism as multiple virtual machines will read and write in the same time.

In second if there is no any delay in trigger on S3, AWS lambda will be run quicker but than executing notebook through API will be slower as it can take dozens of seconds to run job.

Me would go for autloader if there are many files per hour.

If it is 1 file per hour or less I would go with job trigger through REST API.