Autoloader - understanding missing file after sche...

Larrio · ‎03-07-2023

Hello,

Concerning Autoloader (based on https://docs.databricks.com/ingestion/auto-loader/schema.html), so far what I understand is when it detects a schema update, the stream fails and I have to rerun it to make it works, it's ok.

But once I rerun it, it look for missing files, hence the following exception

Caused by: com.databricks.sql.io.FileReadException: Error while reading file s3://some-bucket/path/to/data/1999/10/20/***.parquet. [CLOUD_FILE_SOURCE_FILE_NOT_FOUND] A file notification was received for file: s3://some-bucket/path/to/data/1999/10/20/***.parquet but it does not exist anymore. Please ensure that files are not deleted before they are processed. To continue your stream, you can set the Spark SQL configuration spark.sql.files.ignoreMissingFiles to true.

It works well once I set ignoreMissingFiles to True.

I understand it fails the first time it detects a change, but why does it looks for deleted files the second time autoloader runs ?

What are the impact ? Do I lose data ?

Thanks !

Autoloader - understanding missing file after schema update.