As the titles states I would like to hear how others have setup an AWS s3 bucket to source data with auto loader while supporting the capabilities to archive files after a certain period of time into glacier objects.
We currently have about 20 million json files (avg file size is 400kb) in an s3 bucket which they get written to the root of the bucket. 280k new files added per day. I need to move the old files that have been loaded by auto loaded into a glacier archive directory.
We've tried moving them out of the bucket to a new bucket but run our of memory using datasync in aws and have had a limit increase support ticket open for almost 2 weeks. This feels like to much of a hassle for something that should be fairly straight forward.
By adding a directory to the root in this case would cause issues with auto loader trying to read the files and us being charged for accessing glacier objects by aws.
The other option I can think of would be to have the new files written to one directory and have the archive files in glacier written to a separate directory which is not a sub directory of the other. It would look like this:
../new_data/
../archive/
Auto loader would then be pointed to only "/new_data/" and would only see those files. Checkpoints would need to be reset but I think we would accomplish our goal. Also, is this approach too simple or should the directory have more context added for example "/new_data/use_case_name/" to support other data that may be added at a future date for other use cases?
Thanks in advance!