cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Partition in Spark

NanthakumarYoga
New Contributor

Hi Community, Need your help on understanding below topics.. 

I have a huge transaction file ( 20GB ) partition by transaction_date column , parquet file. I have evenly distributed data ( no skew ). There are 10 days of data and we have 10 partition folder each contains 1 GB. 

Path = '\FileStore\Nantha\Trx\data\2024-01-01\'    ....  '\FileStore\Nantha\Trx\data\2024-01-10'

Now, I would like to understand here,

1. While reading the file without where condition, ( a simple read ). How spark partition the data and process in parrallel. ( default partition size is 128MB ). Here i am confused on partition size is 128MB. what is this ??

2. when we use shuffle.partition is 200, which means 200 partitions having each 128MB ? how this partition is refered and calculated. Are these internal one.

3. When we are issues cache or persist on dataframe, will this store the whole dataframe in MEMORY / DISK... or will it store as partition internally. 

 

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @NanthakumarYoga, Let’s delve into each of your questions about Spark and data partitioning:

  1. Data Partitioning and Parallel Processing:

    • When you read a large Parquet file without any specific where condition (a simple read), Spark automatically partitions the data for parallel processing.
    • By default, Spark aims for a partition size of 128MB. This means that the entire dataset is divided into smaller chunks (partitions), each approximately 128MB in size.
    • These partitions are distributed across the available processing nodes (executors) in your Spark cluster.
    • The goal is to ensure that each executor processes a manageable portion of the data concurrently, maximizing parallelism and overall performance.
    • If your data is evenly distributed across the 10 days and 10 partitions, Spark will create 10 partitions, each containing data for one day.
  2. Shuffle Partitions and Their Size:

    • When you perform operations that involve data shuffling (such as joins or aggregations), Spark uses shuffle partitions.
    • The shuffle.partition configuration determines the number of shuffle partitions. By default, it’s set to 200.
    • These shuffle partitions are not directly related to the 128MB partition size used during simple reads.
    • Instead, shuffle partitions control how data is redistributed during shuffling operations.
    • For example, if you perform a join, Spark will shuffle data between partitions to align matching keys.
    • The shuffle partitions help manage the data movement during these operations, ensuring efficient parallelism.
    • The actual size of each shuffle partition depends on the data distribution and the total amount of data being shuffled.
  3. Cache and Persist:

    • When you cache or persist a DataFrame, Spark stores the intermediate result in memory or on disk.
    • The storage level determines where the data is stored:
      • MEMORY_ONLY: Stores the data in memory as deserialized Java objects.
      • MEMORY_AND_DISK: Stores the data in memory, and if it doesn’t fit, spills to disk.
      • Other storage levels include serialization options (MEMORY_ONLY_SER, MEMORY_AND_DISK_SER) and disk-only storage (DISK_ONLY).
    • When you cache or persist a DataFrame, it doesn’t store the entire DataFrame as a whole. Instead:
      • Each partition of the DataFrame is cached independently.
      • If a partition doesn’t fit in memory, it spills to disk (based on the storage level).
      • Cached partitions are reused across subsequent actions on the same DataFrame.
    • Caching helps avoid recomputing expensive transformations and improves performance for iterative or interactive workloads.

In summary:

  • Spark partitions data for parallel processing during reads.
  • Shuffle partitions are used during data shuffling operations.
  • Caching/persisting stores individual partitions in memory or on disk, not the entire DataFrame.

Feel free to explore these concepts further, and let me know if you have any more questions! 😊

 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.