- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-24-2024 05:15 AM
Auto loader is using spark structered streaming, but you can use it in a "batch" mode. In one of my earlier responses I've mentioned that you can ran it as batch jobs with Trigger.AvailableNow. And once again link to documentation.
Configure Auto Loader for production workloads | Databricks on AWS
How it works:
- you setup diagonostic setting to load logs into storage directory (for the sake of example let's called it -> "input_data")
- you configure auto loader and in configuration you point to that path -> "input_data"
- you configure job to run once in an hour
- your job start and auto loader will load all files that are in "input_data" to the target table. When the job ends the job cluster will be terminated
- in the meantime another logs are written to the storage (to the "input_data" directory)
- hours passed, so once again you're job is starting. This time auto loader will load only new files that arrived since last time