cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Partition in Spark

NanthakumarYoga
New Contributor II

Hi Community, Need your help on understanding below topics.. 

I have a huge transaction file ( 20GB ) partition by transaction_date column , parquet file. I have evenly distributed data ( no skew ). There are 10 days of data and we have 10 partition folder each contains 1 GB. 

Path = '\FileStore\Nantha\Trx\data\2024-01-01\'    ....  '\FileStore\Nantha\Trx\data\2024-01-10'

Now, I would like to understand here,

1. While reading the file without where condition, ( a simple read ). How spark partition the data and process in parrallel. ( default partition size is 128MB ). Here i am confused on partition size is 128MB. what is this ??

2. when we use shuffle.partition is 200, which means 200 partitions having each 128MB ? how this partition is refered and calculated. Are these internal one.

3. When we are issues cache or persist on dataframe, will this store the whole dataframe in MEMORY / DISK... or will it store as partition internally. 

 

2 REPLIES 2

payalbhatia
New Contributor II

I have follow up questions here :
1) OP mentions about the 1 GB of data in each folder. So , the spark will read ~8 partitions on 8 cores(if there ) ?
2)what if I get empty partitions after shuffle?

Personal1
New Contributor II

I read a .zip file in Spark and get unreadable data when I run show() on the data frame.

When I check the number of partitions using df.rdd.getNumPartitions(), I get 8 (the number of cores I am using). Shouldn't the partition count be just 1 as I read a non-splittable/compressed file?

When I was using only 1 core, then I had got only 1 partition though. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group