cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the bestway to handle huge gzipped file dropped to S3 ?

RIDBX
New Contributor II

What is the bestway to handle huge gzipped file dropped to S3 ?=================================================

I find some intereting suggestions for posted questions. Thanks for reviewing my threads. Here is the situation we have.

We are getting data feed from on-prem to S3 , where datafeed is able to push data in only gzip format. When the dropped file are huge in s3 bucket/folder (eg north of 20GB). We are facing loading challenges in databricks. Our current databricks autoloader takes very long time with risk of retry upon failures of this load.

We know other databases have bulk load options with Parallel split/distribution  to handle this sutuation. 

Do we have such or better options in Databricks?

Are we able use databricks external table to tie S3 gizip file with partition option ? will this help ?

Do we need to go to upstraem on-prem data push process and ask for dropping files in smaller sizes (eg 4gb) into S3? 😂

I see the standard way for this decribed as 

"Read the Gzip File from S3: Use boto3 to read the gzip file from S3 and load it into your Databricks environment."

How did folks in this community addres these issues ?

Thanks for your guidance.

 

 

 

1 REPLY 1

RIDBX
New Contributor II

@Retired_mod 

Thanks for weighing in.

I learned that RDD is the predecessor to Dataframes.  What is the reason RDD perform better than Dataframes?

Are they using RDD for new implementations?

Thanks for patiently addressing my questions. tAre you able to tell us, what situation/condition each option applicable ?

Thanks for guidance.

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group