cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader delete action on AWS S3

Dejian
New Contributor II

Hi folks, I have been using autoloader to ingest files from S3 bucket.
I tried to add trigger on the workflows to schedule the job to run every 10 minutes. However, recently I'm facing an error that makes the jobs keep failing after a few success run.

Error:
"com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null;"
The autoloader would only work if I change the checkpoint location and do a full reload.


I believe our cloud team does not provide an deleteObject permission on the cluster.
But I would like to understand what exactly the delete action is required for? 

My Autoloader code:

df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", format) \
.option("cloudFiles.inferSchema", inferSchema) \
.option("cloudFiles.schemaLocation", schema_location) \
.load(source_path)

df_with_metadata = df \
.withColumn("ingestion_timestamp", current_timestamp()) \
.withColumn("source_metadata", lit(source_name))

df_with_metadata.writeStream \
.format("delta") \
.option("checkpointLocation", checkpoint_path ) \
.trigger(availableNow=True)\
.toTable(table_name)
 
#Autoloader
2 REPLIES 2

saisaran_g
Contributor

There might be few possibilities : 

can you check this items ? 

1. Is there any s3 bucket policy configured like within timeframe file deletion or file validity configured ? 

2. check the autload configuration once again to validate the option of cleanup : Cloudfiles.cleanup ? 

3. is any one deleting or updating the location path by any chance ? 

4. try adding this configuration and run again  `.option("cloudFiles.cleanup", "false")` in your code.

 

Hope this will help to identify something. 

Happy Learning and solve new errors :
Saran

Dejian
New Contributor II

Hi Saran,

I have contacted databricks support, they say the deleteObject action on S3 is mandatory. So I have request the cloud team to grant the permission for such action on the checkpoint path.
The pipeline working fine now.
An interesting observation is that the offset will only store up to 102 objects in S3 and will start removing older offset.
But this doesn't seem to affect the autoloader to skip the files that had been ingested.

Thank you.

Regards
Dejian

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now