cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What's the best way to run a databricks notebook from AWS Lambda ?

AmanSehgal
Honored Contributor III

I have a trigger in lambda that gets triggered when a new file arrives in S3. I want this file to be straightaway processed using a notebook to Upsert all the data into a delta table.

I'm looking for a solution with minimum latency.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

There are two possible solution:

OR

Additionally in both solution it is important to have private link and access via role (to skip authentication).

In first additionally S3 have to be mounted in databricks. In first you can also use advantage of spark parallelism as multiple virtual machines will read and write in the same time.

In second if there is no any delay in trigger on S3, AWS lambda will be run quicker but than executing notebook through API will be slower as it can take dozens of seconds to run job.

Me would go for autloader if there are many files per hour.

If it is 1 file per hour or less I would go with job trigger through REST API.

View solution in original post

4 REPLIES 4

RKNutalapati
Valued Contributor

Hi @Aman Sehgal​  : I am not sure if you have already explored databricks Autoloader and the limitation for your use case. otherwise you can try using Autoloader and avoid multiple processes.

AmanSehgal
Honored Contributor III

I've used autoloader.. And it works like a charm..

But I'm not sure about what should be the interval for trigger processing time.

I was thinking of it as trigger once only, and when the file arrives in S3, then the lambda will trigger the notebook for processing the file.

The thing is that files could arrive like 5 every minute or maybe once every 3 hours.. The frequency is not set..

But whenever the file arrives, it should be processed with minimum latency.

-werners-
Esteemed Contributor III

basically you are looking for an event based trigger, where the event is the arrival of a new file.

It´s been a while since I worked on AWS but doesn´t Glue have a functionality like this?

On Azure I do exactly the same thing: when a file arrives in a certain location, a data pipeline starts containing a dbrx notebook.

If you want minimal latency, I suggest you use a pool with active workers, but that comes with a price ofc.

Hubert-Dudek
Esteemed Contributor III

There are two possible solution:

OR

Additionally in both solution it is important to have private link and access via role (to skip authentication).

In first additionally S3 have to be mounted in databricks. In first you can also use advantage of spark parallelism as multiple virtual machines will read and write in the same time.

In second if there is no any delay in trigger on S3, AWS lambda will be run quicker but than executing notebook through API will be slower as it can take dozens of seconds to run job.

Me would go for autloader if there are many files per hour.

If it is 1 file per hour or less I would go with job trigger through REST API.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group