Partition in Spark

NanthakumarYoga
New Contributor II

Hi Community, Need your help on understanding below topics.. 

I have a huge transaction file ( 20GB ) partition by transaction_date column , parquet file. I have evenly distributed data ( no skew ). There are 10 days of data and we have 10 partition folder each contains 1 GB. 

Path = '\FileStore\Nantha\Trx\data\2024-01-01\'    ....  '\FileStore\Nantha\Trx\data\2024-01-10'

Now, I would like to understand here,

1. While reading the file without where condition, ( a simple read ). How spark partition the data and process in parrallel. ( default partition size is 128MB ). Here i am confused on partition size is 128MB. what is this ??

2. when we use shuffle.partition is 200, which means 200 partitions having each 128MB ? how this partition is refered and calculated. Are these internal one.

3. When we are issues cache or persist on dataframe, will this store the whole dataframe in MEMORY / DISK... or will it store as partition internally. 

 

payalbhatia
New Contributor II

I have follow up questions here :
1) OP mentions about the 1 GB of data in each folder. So , the spark will read ~8 partitions on 8 cores(if there ) ?
2)what if I get empty partitions after shuffle?

Personal1
New Contributor II

I read a .zip file in Spark and get unreadable data when I run show() on the data frame.

When I check the number of partitions using df.rdd.getNumPartitions(), I get 8 (the number of cores I am using). Shouldn't the partition count be just 1 as I read a non-splittable/compressed file?

When I was using only 1 core, then I had got only 1 partition though.