Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-10-2022 05:45 AM
There are two possible solution:
- autoloader/cloudfiles, better with "File notification" queue to avoid unnecessary scans,
OR
- from lambda sending post request to /api/2.1/jobs/run-now
Additionally in both solution it is important to have private link and access via role (to skip authentication).
In first additionally S3 have to be mounted in databricks. In first you can also use advantage of spark parallelism as multiple virtual machines will read and write in the same time.
In second if there is no any delay in trigger on S3, AWS lambda will be run quicker but than executing notebook through API will be slower as it can take dozens of seconds to run job.
Me would go for autloader if there are many files per hour.
If it is 1 file per hour or less I would go with job trigger through REST API.
My blog: https://databrickster.medium.com/