Databricks

Anonymous · ‎06-08-2021

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So how do I figure out what the ideal partition size should be?

Ideal partition size is expected to be 128 MB to 1 GB. How do I go about making sure that partition size falls in this category?

sajith_appukutt · ‎06-09-2021

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the value

However if you have multiple workloads with different data volumes, instead of manually specifying the configuration for each of these, it is worth looking at AQE & Auto-Optimized Shuffle

AQE adjusts the shufzfle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small. However, AQE does not change the initial partition number by default - so if you are seeing spilling in your jobs you could enable auto optimized shuffle by setting <db_prefix>.autoOptimizeShuffle.enabled to true.

More details at

https://databricks.com/blog/2020/10/21/faster-sql-adaptive-query-execution-in-databricks.html

View solution in original post

sajith_appukutt · ‎06-09-2021

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the value

However if you have multiple workloads with different data volumes, instead of manually specifying the configuration for each of these, it is worth looking at AQE & Auto-Optimized Shuffle

AQE adjusts the shufzfle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small. However, AQE does not change the initial partition number by default - so if you are seeing spilling in your jobs you could enable auto optimized shuffle by setting <db_prefix>.autoOptimizeShuffle.enabled to true.

More details at

https://databricks.com/blog/2020/10/21/faster-sql-adaptive-query-execution-in-databricks.html

Databricks

Ideal number and size of partitions

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!