<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: [SOLVED]  maxPartitionBytes ignored? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12669#M7441</link>
    <description>&lt;P&gt;The AQE will only kick in when you are actually doin transformations (shuffle/broadcast) and it will try to optimize the partition size:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/aqe#dynamically-coalesce-partitions" alt="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/aqe#dynamically-coalesce-partitions" target="_blank"&gt;https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/aqe#dynamically-coalesce-partitions&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;AFAIK the read-partitionsize is indeed defined by maxPartitionBytes.&lt;/P&gt;&lt;P&gt;Now, I do recall a topic on stackoverflow where someone asks a similar question.&lt;/P&gt;&lt;P&gt;And there they mention the compression coded also matters.&lt;/P&gt;&lt;P&gt;Chances are you use snappy compression. If that is the case, the partition size might be defined by the row group size of the parquet files.&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable" alt="https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable" target="_blank"&gt;https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/" alt="http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/" target="_blank"&gt;http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Also David Vrba mentions the compression used too:&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/62648621/spark-sql-files-maxpartitionbytes-not-limiting-max-size-of-written-partitions" alt="https://stackoverflow.com/questions/62648621/spark-sql-files-maxpartitionbytes-not-limiting-max-size-of-written-partitions" target="_blank"&gt;https://stackoverflow.com/questions/62648621/spark-sql-files-maxpartitionbytes-not-limiting-max-size-of-written-partitions&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Sat, 23 Oct 2021 09:36:35 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2021-10-23T09:36:35Z</dc:date>
    <item>
      <title>[SOLVED]  maxPartitionBytes ignored?</title>
      <link>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12664#M7436</link>
      <description>&lt;P&gt;Hello all!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm running a simple read noop query where I read a specific partition of a delta table that looks like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2378i000D140D998E2AFD/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;With the default configuration, I read the data in 12 partitions, which makes sense as the files that are more than 128MB are split. &lt;/P&gt;&lt;P&gt;When I configure "spark.sql.files.maxPartitionBytes" (or "spark.files.maxPartitionBytes") to 64MB, I do read  with 20 partitions as expected. &lt;B&gt;THOUGH &lt;/B&gt;the extra partitions are empty (or some kilobytes) &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have tested with "spark.sql.adaptive.enabled" set to true and false without any change in the behaviour.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any thoughts why this is happening and how force spark to read in smaller partitions? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you in advance for your help!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 13:17:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12664#M7436</guid>
      <dc:creator>pantelis_mare</dc:creator>
      <dc:date>2021-10-22T13:17:25Z</dc:date>
    </item>
    <item>
      <title>Re: [SOLVED]  maxPartitionBytes ignored?</title>
      <link>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12665#M7437</link>
      <description>&lt;P&gt;How did you determine the number of partitions read and the size of these partitions?&lt;/P&gt;&lt;P&gt;The reason I ask is because if your first read the data and then immediately wrote it to another delta table, there is also auto optimize on delta lake, which tries to write 128MB files.&lt;/P&gt;&lt;P&gt;(spark.databricks.delta.autoCompact.maxFileSize)&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 14:35:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12665#M7437</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-22T14:35:39Z</dc:date>
    </item>
    <item>
      <title>Re: [SOLVED]  maxPartitionBytes ignored?</title>
      <link>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12666#M7438</link>
      <description>&lt;P&gt;AQE doesn't affect the read time partitioning but at the shuffle time. It would be better to run optimize on the delta lake which will compact the files to approx 1 GB each, it would provide better read time performance.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 19:56:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12666#M7438</guid>
      <dc:creator>ashish1</dc:creator>
      <dc:date>2021-10-22T19:56:56Z</dc:date>
    </item>
    <item>
      <title>Re: [SOLVED]  maxPartitionBytes ignored?</title>
      <link>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12667#M7439</link>
      <description>&lt;P&gt;Hello Werners,&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;I'm looking at the input size of each ​partion at the stage page of the spark UI. As I said I did a noop operation, there is no actual writing. My goal it to have control on the partition size at read which the conf I'm playing with is supposed to do&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 21:48:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12667#M7439</guid>
      <dc:creator>pantelis_mare</dc:creator>
      <dc:date>2021-10-22T21:48:19Z</dc:date>
    </item>
    <item>
      <title>Re: [SOLVED]  maxPartitionBytes ignored?</title>
      <link>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12668#M7440</link>
      <description>&lt;P&gt;Hello Ashish,&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;I was just wondering as AQE might change the expected behaviour. As stated before, my issue here is to control the partition size at read not to optimise my reading time. &lt;/P&gt;&lt;P&gt;Why it correctly breaks the 180MB file in 2 when 128 is the limit, but not the 108 MB files when the limit is 64​&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 21:51:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12668#M7440</guid>
      <dc:creator>pantelis_mare</dc:creator>
      <dc:date>2021-10-22T21:51:01Z</dc:date>
    </item>
    <item>
      <title>Re: [SOLVED]  maxPartitionBytes ignored?</title>
      <link>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12669#M7441</link>
      <description>&lt;P&gt;The AQE will only kick in when you are actually doin transformations (shuffle/broadcast) and it will try to optimize the partition size:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/aqe#dynamically-coalesce-partitions" alt="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/aqe#dynamically-coalesce-partitions" target="_blank"&gt;https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/aqe#dynamically-coalesce-partitions&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;AFAIK the read-partitionsize is indeed defined by maxPartitionBytes.&lt;/P&gt;&lt;P&gt;Now, I do recall a topic on stackoverflow where someone asks a similar question.&lt;/P&gt;&lt;P&gt;And there they mention the compression coded also matters.&lt;/P&gt;&lt;P&gt;Chances are you use snappy compression. If that is the case, the partition size might be defined by the row group size of the parquet files.&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable" alt="https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable" target="_blank"&gt;https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/" alt="http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/" target="_blank"&gt;http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Also David Vrba mentions the compression used too:&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/62648621/spark-sql-files-maxpartitionbytes-not-limiting-max-size-of-written-partitions" alt="https://stackoverflow.com/questions/62648621/spark-sql-files-maxpartitionbytes-not-limiting-max-size-of-written-partitions" target="_blank"&gt;https://stackoverflow.com/questions/62648621/spark-sql-files-maxpartitionbytes-not-limiting-max-size-of-written-partitions&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 23 Oct 2021 09:36:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12669#M7441</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-23T09:36:35Z</dc:date>
    </item>
    <item>
      <title>Re: [SOLVED]  maxPartitionBytes ignored?</title>
      <link>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12670#M7442</link>
      <description>&lt;P&gt;Thanks &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You have a point there regarding parquet. Just checked, it does read a separate row group in each task, the thing is that the row groups are unbalanced, so the second taks gets just some KB of data.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Case closed, kudos @Werner Stinckens​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 25 Oct 2021 15:42:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12670#M7442</guid>
      <dc:creator>pantelis_mare</dc:creator>
      <dc:date>2021-10-25T15:42:31Z</dc:date>
    </item>
    <item>
      <title>Re: [SOLVED]  maxPartitionBytes ignored?</title>
      <link>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12671#M7443</link>
      <description>&lt;P&gt;Hi @Pantelis Maroudis​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks for your reply. If you think @Werner Stinckens​&amp;nbsp;reply helped you solve this issue, then please mark it as best answer to move it to the top of the thread.   &lt;/P&gt;</description>
      <pubDate>Tue, 26 Oct 2021 20:45:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12671#M7443</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2021-10-26T20:45:41Z</dc:date>
    </item>
  </channel>
</rss>

