Hubert-Dudek
Databricks MVP

There are two possible solution:

OR

Additionally in both solution it is important to have private link and access via role (to skip authentication).

In first additionally S3 have to be mounted in databricks. In first you can also use advantage of spark parallelism as multiple virtual machines will read and write in the same time.

In second if there is no any delay in trigger on S3, AWS lambda will be run quicker but than executing notebook through API will be slower as it can take dozens of seconds to run job.

Me would go for autloader if there are many files per hour.

If it is 1 file per hour or less I would go with job trigger through REST API.


My blog: https://databrickster.medium.com/

View solution in original post