Hi Community, Need your help on understanding below topics..
I have a huge transaction file ( 20GB ) partition by transaction_date column , parquet file. I have evenly distributed data ( no skew ). There are 10 days of data and we have 10 partition folder each contains 1 GB.
Path = '\FileStore\Nantha\Trx\data\2024-01-01\' .... '\FileStore\Nantha\Trx\data\2024-01-10'
Now, I would like to understand here,
1. While reading the file without where condition, ( a simple read ). How spark partition the data and process in parrallel. ( default partition size is 128MB ). Here i am confused on partition size is 128MB. what is this ??
2. when we use shuffle.partition is 200, which means 200 partitions having each 128MB ? how this partition is refered and calculated. Are these internal one.
3. When we are issues cache or persist on dataframe, will this store the whole dataframe in MEMORY / DISK... or will it store as partition internally.