Handling files used more than once in a streaming pipeline
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-09-2024 03:57 PM
I am implementing Structured Streaming using Delta Live Table. I want to delete the parquet files once they are used. What options should I set so that the files loaded in S3 are not deleted?
- Labels:
-
Delta Lake
-
Spark
-
Workflows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-15-2024 06:23 PM - edited 01-15-2024 06:24 PM
Hi,
It sounds like your Structured Streaming source is S3, in which case the easiest solution is likely to manage the stream source using an S3 Lifecycle Configuration (https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html). The objects could be managed by S3 Lifecycle via time-based trigger (e.g. 14 days) or it could be more granularly managed using tags such that after a file is processed, you call the AWS SDK, tag an object with a key/value pair such as `delete=true` and then the lifecycle cleans up objects with those tags.
Thanks.

