Databricks Community

RIDBX · ‎04-06-2024

What is the bestway to handle huge gzipped file dropped to S3 ?=================================================

I find some intereting suggestions for posted questions. Thanks for reviewing my threads. Here is the situation we have.

We are getting data feed from on-prem to S3 , where datafeed is able to push data in only gzip format. When the dropped file are huge in s3 bucket/folder (eg north of 20GB). We are facing loading challenges in databricks. Our current databricks autoloader takes very long time with risk of retry upon failures of this load.

We know other databases have bulk load options with Parallel split/distribution to handle this sutuation.

Do we have such or better options in Databricks?

Are we able use databricks external table to tie S3 gizip file with partition option ? will this help ?

Do we need to go to upstraem on-prem data push process and ask for dropping files in smaller sizes (eg 4gb) into S3? 😂

I see the standard way for this decribed as

"Read the Gzip File from S3: Use boto3 to read the gzip file from S3 and load it into your Databricks environment."

How did folks in this community addres these issues ?

Thanks for your guidance.

RIDBX · ‎04-08-2024

@Retired_mod

Thanks for weighing in.

I learned that RDD is the predecessor to Dataframes. What is the reason RDD perform better than Dataframes?

Are they using RDD for new implementations?

Thanks for patiently addressing my questions. tAre you able to tell us, what situation/condition each option applicable ?

Thanks for guidance.

Databricks Community

What is the bestway to handle huge gzipped file dropped to S3 ?

Join Us as a Local Community Builder!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

Databricks Community Champion - September 2025 - Nayanjyoti Sonowal

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming

Level Up with Databricks Specialist Sessions