Databricks Community

Dejian · 3 weeks ago

Hi folks, I have been using autoloader to ingest files from S3 bucket.
I tried to add trigger on the workflows to schedule the job to run every 10 minutes. However, recently I'm facing an error that makes the jobs keep failing after a few success run.

Error:
"com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null;"
The autoloader would only work if I change the checkpoint location and do a full reload.

I believe our cloud team does not provide an deleteObject permission on the cluster.
But I would like to understand what exactly the delete action is required for?

My Autoloader code:

df = spark.readStream.format("cloudFiles") \

.option("cloudFiles.format", format) \

.option("cloudFiles.inferSchema", inferSchema) \

.option("cloudFiles.schemaLocation", schema_location) \

.load(source_path)

df_with_metadata = df \

.withColumn("ingestion_timestamp", current_timestamp()) \

.withColumn("source_metadata", lit(source_name))

df_with_metadata.writeStream \

.format("delta") \

.option("checkpointLocation", checkpoint_path ) \

.trigger(availableNow=True)\

.toTable(table_name)

#Autoloader

saisaran_g · 3 weeks ago

There might be few possibilities :

can you check this items ?

1. Is there any s3 bucket policy configured like within timeframe file deletion or file validity configured ?

2. check the autload configuration once again to validate the option of cleanup : Cloudfiles.cleanup ?

3. is any one deleting or updating the location path by any chance ?

4. try adding this configuration and run again `.option("cloudFiles.cleanup", "false")` in your code.

Hope this will help to identify something.

Happy Learning and solve new errors :
Saran

Dejian · 2 weeks ago

Hi Saran,

I have contacted databricks support, they say the deleteObject action on S3 is mandatory. So I have request the cloud team to grant the permission for such action on the checkpoint path.
The pipeline working fine now.
An interesting observation is that the offset will only store up to 102 objects in S3 and will start removing older offset.
But this doesn't seem to affect the autoloader to skip the files that had been ingested.

Thank you.

Regards
Dejian