Last file in S3 folder using autoloader
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-26-2024 10:38 AM
Nowadays we already use the autoloader with checkpoint location, but I still wanted to know if it is possible to read only the last updated file within a folder. I know it somewhat loses the purpose of checkpoint locatio
Another question is it possible to obtain any information if the autoloader returns an empty dataframe without having to run a count on the dataframe? Some parameter of how many files or mb were read.
- Labels:
-
Delta Lake
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-06-2024 01:04 PM - edited 12-06-2024 01:05 PM
Auto loader's scope is limited to incrementally loading files from storage, and there is no such functionality to just load the latest file from a group of files, you'd likely want to have this kind of "last updated" logic in a different layer or in flight by using stateful processing or some other foreachBatch logic.
For autoloader metrics, please look here. The gist of it is that after each batch, metrics are emitted including the number of records, time spent processing, and backlog.

