cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Partition in Spark

NanthakumarYoga
New Contributor II

Hi Community, Need your help on understanding below topics.. 

I have a huge transaction file ( 20GB ) partition by transaction_date column , parquet file. I have evenly distributed data ( no skew ). There are 10 days of data and we have 10 partition folder each contains 1 GB. 

Path = '\FileStore\Nantha\Trx\data\2024-01-01\'    ....  '\FileStore\Nantha\Trx\data\2024-01-10'

Now, I would like to understand here,

1. While reading the file without where condition, ( a simple read ). How spark partition the data and process in parrallel. ( default partition size is 128MB ). Here i am confused on partition size is 128MB. what is this ??

2. when we use shuffle.partition is 200, which means 200 partitions having each 128MB ? how this partition is refered and calculated. Are these internal one.

3. When we are issues cache or persist on dataframe, will this store the whole dataframe in MEMORY / DISK... or will it store as partition internally. 

 

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @NanthakumarYoga, Let’s delve into each of your questions about Spark and data partitioning:

  1. Data Partitioning and Parallel Processing:

    • When you read a large Parquet file without any specific where condition (a simple read), Spark automatically partitions the data for parallel processing.
    • By default, Spark aims for a partition size of 128MB. This means that the entire dataset is divided into smaller chunks (partitions), each approximately 128MB in size.
    • These partitions are distributed across the available processing nodes (executors) in your Spark cluster.
    • The goal is to ensure that each executor processes a manageable portion of the data concurrently, maximizing parallelism and overall performance.
    • If your data is evenly distributed across the 10 days and 10 partitions, Spark will create 10 partitions, each containing data for one day.
  2. Shuffle Partitions and Their Size:

    • When you perform operations that involve data shuffling (such as joins or aggregations), Spark uses shuffle partitions.
    • The shuffle.partition configuration determines the number of shuffle partitions. By default, it’s set to 200.
    • These shuffle partitions are not directly related to the 128MB partition size used during simple reads.
    • Instead, shuffle partitions control how data is redistributed during shuffling operations.
    • For example, if you perform a join, Spark will shuffle data between partitions to align matching keys.
    • The shuffle partitions help manage the data movement during these operations, ensuring efficient parallelism.
    • The actual size of each shuffle partition depends on the data distribution and the total amount of data being shuffled.
  3. Cache and Persist:

    • When you cache or persist a DataFrame, Spark stores the intermediate result in memory or on disk.
    • The storage level determines where the data is stored:
      • MEMORY_ONLY: Stores the data in memory as deserialized Java objects.
      • MEMORY_AND_DISK: Stores the data in memory, and if it doesn’t fit, spills to disk.
      • Other storage levels include serialization options (MEMORY_ONLY_SER, MEMORY_AND_DISK_SER) and disk-only storage (DISK_ONLY).
    • When you cache or persist a DataFrame, it doesn’t store the entire DataFrame as a whole. Instead:
      • Each partition of the DataFrame is cached independently.
      • If a partition doesn’t fit in memory, it spills to disk (based on the storage level).
      • Cached partitions are reused across subsequent actions on the same DataFrame.
    • Caching helps avoid recomputing expensive transformations and improves performance for iterative or interactive workloads.

In summary:

  • Spark partitions data for parallel processing during reads.
  • Shuffle partitions are used during data shuffling operations.
  • Caching/persisting stores individual partitions in memory or on disk, not the entire DataFrame.

Feel free to explore these concepts further, and let me know if you have any more questions! 😊

 

payalbhatia
New Contributor II

I have follow up questions here :
1) OP mentions about the 1 GB of data in each folder. So , the spark will read ~8 partitions on 8 cores(if there ) ?
2)what if I get empty partitions after shuffle?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group