topic Re: Partition in Spark in Data Engineering

Partition in Spark

NanthakumarYoga — Mon, 25 Mar 2024 12:31:30 GMT

Hi Community, Need your help on understanding below topics..

I have a huge transaction file ( 20GB ) partition by transaction_date column , parquet file. I have evenly distributed data ( no skew ). There are 10 days of data and we have 10 partition folder each contains 1 GB.

Path = '\FileStore\Nantha\Trx\data\2024-01-01\' .... '\FileStore\Nantha\Trx\data\2024-01-10'

Now, I would like to understand here,

1. While reading the file without where condition, ( a simple read ). How spark partition the data and process in parrallel. ( default partition size is 128MB ). Here i am confused on partition size is 128MB. what is this ??

2. when we use shuffle.partition is 200, which means 200 partitions having each 128MB ? how this partition is refered and calculated. Are these internal one.

3. When we are issues cache or persist on dataframe, will this store the whole dataframe in MEMORY / DISK... or will it store as partition internally.

Re: Partition in Spark

payalbhatia — Sun, 21 Jul 2024 12:27:51 GMT

I have follow up questions here :
1) OP mentions about the 1 GB of data in each folder. So , the spark will read ~8 partitions on 8 cores(if there ) ?
2)what if I get empty partitions after shuffle?

Re: Partition in Spark

Personal1 — Tue, 01 Oct 2024 23:04:29 GMT

I read a .zip file in Spark and get unreadable data when I run show() on the data frame.

When I check the number of partitions using df.rdd.getNumPartitions(), I get 8 (the number of cores I am using). Shouldn't the partition count be just 1 as I read a non-splittable/compressed file?

When I was using only 1 core, then I had got only 1 partition though.