<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Partition in Spark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/partition-in-spark/m-p/79580#M35786</link>
    <description>&lt;P&gt;I have follow up questions here :&lt;BR /&gt;1) OP mentions about the 1 GB of data in each folder. So , the spark will read ~8 partitions on 8 cores(if there ) ?&lt;BR /&gt;2)what if I get empty partitions after shuffle?&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
    <pubDate>Sun, 21 Jul 2024 12:27:51 GMT</pubDate>
    <dc:creator>payalbhatia</dc:creator>
    <dc:date>2024-07-21T12:27:51Z</dc:date>
    <item>
      <title>Partition in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/partition-in-spark/m-p/64509#M32592</link>
      <description>&lt;P&gt;Hi Community, Need your help on understanding below topics..&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a huge transaction file ( 20GB ) partition by transaction_date column , parquet file. I have evenly distributed data ( no skew ). There are 10 days of data and we have 10 partition folder each contains 1 GB.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Path = '\FileStore\Nantha\Trx\data\2024-01-01\'&amp;nbsp; &amp;nbsp; ....&amp;nbsp; '\FileStore\Nantha\Trx\data\2024-01-10'&lt;/P&gt;&lt;P&gt;Now, I would like to understand here,&lt;/P&gt;&lt;P&gt;1. While reading the file without where condition, ( a simple read ). How spark partition the data and process in parrallel. ( default partition size is 128MB ). Here i am confused on partition size is 128MB. what is this ??&lt;/P&gt;&lt;P&gt;2. when we use shuffle.partition is 200, which means 200 partitions having each 128MB ? how this partition is refered and calculated. Are these internal one.&lt;/P&gt;&lt;P&gt;3. When we are issues cache or persist on dataframe, will this store the whole dataframe in MEMORY / DISK... or will it store as partition internally.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 25 Mar 2024 12:31:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/partition-in-spark/m-p/64509#M32592</guid>
      <dc:creator>NanthakumarYoga</dc:creator>
      <dc:date>2024-03-25T12:31:30Z</dc:date>
    </item>
    <item>
      <title>Re: Partition in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/partition-in-spark/m-p/79580#M35786</link>
      <description>&lt;P&gt;I have follow up questions here :&lt;BR /&gt;1) OP mentions about the 1 GB of data in each folder. So , the spark will read ~8 partitions on 8 cores(if there ) ?&lt;BR /&gt;2)what if I get empty partitions after shuffle?&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 21 Jul 2024 12:27:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/partition-in-spark/m-p/79580#M35786</guid>
      <dc:creator>payalbhatia</dc:creator>
      <dc:date>2024-07-21T12:27:51Z</dc:date>
    </item>
    <item>
      <title>Re: Partition in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/partition-in-spark/m-p/92504#M38452</link>
      <description>&lt;P&gt;I read a .zip file in Spark and get unreadable data when I run show() on the data frame.&lt;/P&gt;&lt;P&gt;When I check the number of partitions using df.rdd.getNumPartitions(), I get 8 (the number of cores I am using). Shouldn't the partition count be just 1 as I read a non-splittable/compressed file?&lt;/P&gt;&lt;P&gt;When I was using only 1 core, then I had got only 1 partition though.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Oct 2024 23:04:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/partition-in-spark/m-p/92504#M38452</guid>
      <dc:creator>Personal1</dc:creator>
      <dc:date>2024-10-01T23:04:29Z</dc:date>
    </item>
  </channel>
</rss>

