Partition in Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-25-2024 05:31 AM
Hi Community, Need your help on understanding below topics..
I have a huge transaction file ( 20GB ) partition by transaction_date column , parquet file. I have evenly distributed data ( no skew ). There are 10 days of data and we have 10 partition folder each contains 1 GB.
Path = '\FileStore\Nantha\Trx\data\2024-01-01\' .... '\FileStore\Nantha\Trx\data\2024-01-10'
Now, I would like to understand here,
1. While reading the file without where condition, ( a simple read ). How spark partition the data and process in parrallel. ( default partition size is 128MB ). Here i am confused on partition size is 128MB. what is this ??
2. when we use shuffle.partition is 200, which means 200 partitions having each 128MB ? how this partition is refered and calculated. Are these internal one.
3. When we are issues cache or persist on dataframe, will this store the whole dataframe in MEMORY / DISK... or will it store as partition internally.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-21-2024 05:27 AM
I have follow up questions here :
1) OP mentions about the 1 GB of data in each folder. So , the spark will read ~8 partitions on 8 cores(if there ) ?
2)what if I get empty partitions after shuffle?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2024 04:04 PM
I read a .zip file in Spark and get unreadable data when I run show() on the data frame.
When I check the number of partitions using df.rdd.getNumPartitions(), I get 8 (the number of cores I am using). Shouldn't the partition count be just 1 as I read a non-splittable/compressed file?
When I was using only 1 core, then I had got only 1 partition though.