What is the bestway to handle huge gzipped file dropped to S3 ?=================================================
I find some intereting suggestions for posted questions. Thanks for reviewing my threads. Here is the situation we have.
We are getting data feed from on-prem to S3 , where datafeed is able to push data in only gzip format. When the dropped file are huge in s3 bucket/folder (eg north of 20GB). We are facing loading challenges in databricks. Our current databricks autoloader takes very long time with risk of retry upon failures of this load.
We know other databases have bulk load options with Parallel split/distribution to handle this sutuation.
Do we have such or better options in Databricks?
Are we able use databricks external table to tie S3 gizip file with partition option ? will this help ?
Do we need to go to upstraem on-prem data push process and ask for dropping files in smaller sizes (eg 4gb) into S3? 😂
I see the standard way for this decribed as
"Read the Gzip File from S3: Use boto3 to read the gzip file from S3 and load it into your Databricks environment."
How did folks in this community addres these issues ?
Thanks for your guidance.