Hi folks, I have been using autoloader to ingest files from S3 bucket.
I tried to add trigger on the workflows to schedule the job to run every 10 minutes. However, recently I'm facing an error that makes the jobs keep failing after a few success run.
Error:
"com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null;"
The autoloader would only work if I change the checkpoint location and do a full reload.
I believe our cloud team does not provide an deleteObject permission on the cluster.
But I would like to understand what exactly the delete action is required for?
My Autoloader code:
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", format) \
.option("cloudFiles.inferSchema", inferSchema) \
.option("cloudFiles.schemaLocation", schema_location) \
.load(source_path)
df_with_metadata = df \
.withColumn("ingestion_timestamp", current_timestamp()) \
.withColumn("source_metadata", lit(source_name))
df_with_metadata.writeStream \
.format("delta") \
.option("checkpointLocation", checkpoint_path ) \
.trigger(availableNow=True)\
.toTable(table_name)
#Autoloader