jerickson
New Contributor II

I am trying to come up with a low code/low cost data ingestion pattern for a publicly available dataset on S3 (https://www.ncbi.nlm.nih.gov/pmc/tools/pmcaws/ where there are hundreds of thousands of files in a folder, and more added daily.

There is a 'file list' csv file (possibly an S3 inventory?) that would probably be a better way to identify all of the documents than trying to do it via boto3 (which works but is limited to 1000 results and might be costly):

 

s3_client = boto3.client(
    's3',
    region_name='us-east-1',  # Specify the region
    config=Config(signature_version=UNSIGNED)
)
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=folder_prefix)
 
I was hoping to create an external table on the file list csv (which is 600 MB) as opposed to trying to download it daily, but I think that might be optimistic.
 
Any other ideas are appreciated...
Thanks.