I am trying to come up with a low code/low cost data ingestion pattern for a publicly available dataset on S3 (https://www.ncbi.nlm.nih.gov/pmc/tools/pmcaws/ where there are hundreds of thousands of files in a folder, and more added daily.
There is a 'file list' csv file (possibly an S3 inventory?) that would probably be a better way to identify all of the documents than trying to do it via boto3 (which works but is limited to 1000 results and might be costly):
s3_client = boto3.client(
's3',
region_name='us-east-1', # Specify the region
config=Config(signature_version=UNSIGNED)
)
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=folder_prefix)
I was hoping to create an external table on the file list csv (which is 600 MB) as opposed to trying to download it daily, but I think that might be optimistic.
Any other ideas are appreciated...
Thanks.