Databricks Community

NanthakumarYoga · ‎03-25-2024

Hi Community, Need your help on understanding below topics..

I have a huge transaction file ( 20GB ) partition by transaction_date column , parquet file. I have evenly distributed data ( no skew ). There are 10 days of data and we have 10 partition folder each contains 1 GB.

Path = '\FileStore\Nantha\Trx\data\2024-01-01\' .... '\FileStore\Nantha\Trx\data\2024-01-10'

Now, I would like to understand here,

1. While reading the file without where condition, ( a simple read ). How spark partition the data and process in parrallel. ( default partition size is 128MB ). Here i am confused on partition size is 128MB. what is this ??

2. when we use shuffle.partition is 200, which means 200 partitions having each 128MB ? how this partition is refered and calculated. Are these internal one.

3. When we are issues cache or persist on dataframe, will this store the whole dataframe in MEMORY / DISK... or will it store as partition internally.

payalbhatia · ‎07-21-2024

I have follow up questions here :
1) OP mentions about the 1 GB of data in each folder. So , the spark will read ~8 partitions on 8 cores(if there ) ?
2)what if I get empty partitions after shuffle?

Personal1 · ‎10-01-2024

I read a .zip file in Spark and get unreadable data when I run show() on the data frame.

When I check the number of partitions using df.rdd.getNumPartitions(), I get 8 (the number of cores I am using). Shouldn't the partition count be just 1 as I read a non-splittable/compressed file?

When I was using only 1 core, then I had got only 1 partition though.

Databricks Community

Partition in Spark

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs