Re: CREATE EXTERNAL LOCATION on a publicly availab...

jerickson · ‎10-04-2024

I am trying to come up with a low code/low cost data ingestion pattern for a publicly available dataset on S3 (https://www.ncbi.nlm.nih.gov/pmc/tools/pmcaws/ where there are hundreds of thousands of files in a folder, and more added daily.

There is a 'file list' csv file (possibly an S3 inventory?) that would probably be a better way to identify all of the documents than trying to do it via boto3 (which works but is limited to 1000 results and might be costly):

s3_client = boto3.client(

's3',

region_name='us-east-1', # Specify the region

config=Config(signature_version=UNSIGNED)

)

response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=folder_prefix)

I was hoping to create an external table on the file list csv (which is 600 MB) as opposed to trying to download it daily, but I think that might be optimistic.

Any other ideas are appreciated...

Thanks.