cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Confused with databricks Tips and Tricks - Optimizations regarding partitining

eimis_pacheco
Contributor

Hello Community,

Today I was in Tips and Tricks - Optimizations webinar and I started being confused, they said:

"Don't partition tables <1TB in size and plan carefully when partitioning

โ€ข Partitions should be >=1GB"

 

Now my confusion is if this recommendation is given for storing the data on disk while writing it at the end of the spark job? or does this apply when we are running a job and we are doing some transformations and we want to split this data into more executors so that this run faster? I mean if I want to partition a table that is close to 1TB to make the job faster by splitting the same let's say in 10 executors should I not do this?

 

Thank you in advance for the clarification.

 

Thanks

#dataengineering

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @eimis_pacheco

- Partitioning tables <1TB in size and plan carefully when partitioning.
- Partitioning strategy impacts Spark job performance.
- Partitioning may not benefit smaller datasets (<1TB).
- The overhead of managing multiple partitions may outweigh the benefits of parallel processing.
- The recommendation is not to partition tables smaller than 1TB.
- Partitioning can improve performance for large data sets (close to 1TB or more).
- Each partition should be at least 1GB for even distribution and sufficient data for processing.
- Consider the workload nature and available resources when deciding on a partitioning strategy.
- Example of partitioning data in Spark using repartition and partitionBy
- Refer to the Spark SQL Performance Tuning Guide for more tips and tricks on optimizing Spark jobs.
- Efficient partitioning balances the number of partitions and resources available.

Hi @Kaniz_Fatma ,

Sorry for asking again but this recommendation of "don't partition tables <1TB" but this recommendation applies when this is written on disk or when the job is actually running in memory too? (sorry if the answer is obvious)

Because in clusters we have GBs of memory to allocate to our datasets not TBs.

eimis_pacheco_0-1697575164841.png

Thank you once more time

 

Hi @Kaniz_Fatma Thanks for the tricks. 
I have a table where roughly 1.7 TB data is being ingested daily. 
I have partitioned it on date. In s3 I can see the size of files for one day in parquet format as 1.7 TB. 
Should I change the partition values ? Is the partition size too big?

 

-werners-
Esteemed Contributor III

that is partitions on disk.


Defining the correct amount of partitions is not that easy.  One would think that more partitions is better because you can process more data in parallel.
And that is true if you only have to do local transformations (no shuffle needed).
But that is almost never the case.
When a shuffle is applied, having more partitions means more overhead.  That is why the recommendation is to not partition below 1TB (that is only a recommendation though, you might have cases where partitioning makes sense with smaller data).

Recently there is liquid clustering, which makes things easier (no partitioning necessary), but it only works for delta lake and recent databricks versions.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!