12-06-2021 03:12 AM
Hello,
We can use Autoloader to track the files that have been loaded from S3 bucket or not. My question about Autoloader: is there a way to read the Autoloader database to get the list of files that have been loaded?
I can easily do this in AWS Glue job bookmark, but I'm not aware on how to do this in Databricks Autoloader.
12-06-2021 03:25 AM
.load("path")
.withColumn("filePath",input_file_name())
than you can for example insert filePath to your stream sink and than get distinct value from there or use forEatch / forEatchBatch and for example insert it into spark sql table
12-06-2021 03:25 AM
.load("path")
.withColumn("filePath",input_file_name())
than you can for example insert filePath to your stream sink and than get distinct value from there or use forEatch / forEatchBatch and for example insert it into spark sql table
12-09-2021 07:55 AM
Thank you! This works for me 🙏
08-01-2024 02:03 PM
More efficient way
SELECT * FROM cloud_files_state('path/to/checkpoint');
12-09-2021 08:12 AM
@Herry Ramli - Would you be happy to mark Hubert's answer as best so that other members can find the solution more easily?
Thanks!
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now