<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Parquet partitionBy - date column to nested folders in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/parquet-partitionby-date-column-to-nested-folders/m-p/27582#M19447</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Unfortunately the format has been broken on saving &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Here is the structure as-is example:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0693f000007OrnrAAC"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2507i52892AB8967C82B2/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693f000007OrnrAAC" alt="0693f000007OrnrAAC" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Here is the desired structure example:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0693f000007OrnqAAC"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2506i7AA319FE54AE76FD/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693f000007OrnqAAC" alt="0693f000007OrnqAAC" /&gt;&lt;/span&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 11 Nov 2019 14:17:10 GMT</pubDate>
    <dc:creator>1stcommander</dc:creator>
    <dc:date>2019-11-11T14:17:10Z</dc:date>
    <item>
      <title>Parquet partitionBy - date column to nested folders</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-partitionby-date-column-to-nested-folders/m-p/27581#M19446</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;when writing a DataFrame to parquet using partitionBy(&amp;lt;date column&amp;gt;), the resulting folder structure looks like this:&lt;/P&gt;
&lt;P&gt;root&lt;/P&gt;
&lt;P&gt;|----------------- day1&lt;/P&gt;
&lt;P&gt;|----------------- day2&lt;/P&gt;
&lt;P&gt;|----------------- day3&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Is it possible to create a structure like to following without explicitely creating the partitioning columns:&lt;/P&gt;
&lt;P&gt;root&lt;/P&gt;
&lt;P&gt;|----- year1&lt;/P&gt;
&lt;P&gt;|-----month1&lt;/P&gt;
&lt;P&gt;|----- day1&lt;/P&gt;
&lt;P&gt;|----- .... &lt;/P&gt;
&lt;P&gt;|----- year2&lt;/P&gt;
&lt;P&gt;|----- month&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I know that i could achieve it with something like&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;df
.withColumn("year", year(col("date_col"))).withColumn("month", month(col("date_col"))).withColumn("day", dayofmonth(col("date_col"))).withColumn("hour", hour(col("date_col"))).partitionBy("year","month","day","hour")&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;taken from (https://stackoverflow.com/questions/52527888/spark-partition-data-writing-by-timestamp), &lt;/P&gt;
&lt;P&gt;but when you do like this you also have to use the "virtual" columns when querying from the files in SparkSQL afterwards in order to profit from partition pruning. (In the example, you have to use "WHERE year = 2017 AND month = 2 " - if you use "WHERE date_col &amp;gt;= to_date('2017-02-01') AND date_col &amp;lt;= to_date('2017-03-01')" it doesn`t use partition pruning.&lt;/P&gt;
&lt;P&gt;I'm wondering if there is some functionality that i currently just do not know about that can &lt;/P&gt;
&lt;P&gt;a) automatically create the nested folder structure&lt;/P&gt;
&lt;P&gt;b) also use this for partition pruning while querying&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Thank you&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Nov 2019 14:10:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-partitionby-date-column-to-nested-folders/m-p/27581#M19446</guid>
      <dc:creator>1stcommander</dc:creator>
      <dc:date>2019-11-11T14:10:40Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet partitionBy - date column to nested folders</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-partitionby-date-column-to-nested-folders/m-p/27582#M19447</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Unfortunately the format has been broken on saving &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Here is the structure as-is example:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0693f000007OrnrAAC"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2507i52892AB8967C82B2/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693f000007OrnrAAC" alt="0693f000007OrnrAAC" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Here is the desired structure example:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0693f000007OrnqAAC"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2506i7AA319FE54AE76FD/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693f000007OrnqAAC" alt="0693f000007OrnqAAC" /&gt;&lt;/span&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Nov 2019 14:17:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-partitionby-date-column-to-nested-folders/m-p/27582#M19447</guid>
      <dc:creator>1stcommander</dc:creator>
      <dc:date>2019-11-11T14:17:10Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet partitionBy - date column to nested folders</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-partitionby-date-column-to-nested-folders/m-p/27583#M19448</link>
      <description>&lt;P&gt;Hey @1stcommander​&amp;nbsp;&lt;/P&gt;&lt;P&gt;You'll have to create those columns yourself. If it's something you will have to do often you could always write a function. In any case, imho it's not that much work. &lt;/P&gt;&lt;P&gt;Im not sure what your problem is with the partition pruning. It's almost as if you're saying you want the exact thing you said you dont want. &lt;/P&gt;&lt;P&gt;Good luck&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Nov 2019 14:09:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-partitionby-date-column-to-nested-folders/m-p/27583#M19448</guid>
      <dc:creator>Saphira</dc:creator>
      <dc:date>2019-11-13T14:09:41Z</dc:date>
    </item>
  </channel>
</rss>

