<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: parquet file to include partitioned column in file in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32478#M23672</link>
    <description>&lt;P&gt;Thanks for reply. Can you suggest consumers when reading custom code to read files can get partitional column?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;presently consumer is getting list of all files in folder, and filtering out files already processed and then read each new file with &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.read.format('parquet').load(filePath)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 28 Dec 2021 13:05:57 GMT</pubDate>
    <dc:creator>guruv</dc:creator>
    <dc:date>2021-12-28T13:05:57Z</dc:date>
    <item>
      <title>parquet file to include partitioned column in file</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32476#M23670</link>
      <description>&lt;P&gt;HI,&lt;/P&gt;&lt;P&gt;I have a daily scheduled job which processes the data and write as parquet file in a specific folder structure like root_folder/{CountryCode}/parquetfiles. Where each day job will write new data for countrycode under the folder for countrycode&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am trying to achieve this by using &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;dataframe.partitionBy("countryCode").write.parquet(root_Folder)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;this is creation a folder structure like&lt;/P&gt;&lt;P&gt;root_folder/countryCode=x/part1-snappy.parquet&lt;/P&gt;&lt;P&gt;root_folder/countryCode=x/part2-snappy.parquet&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;root_folder/countryCode=y/part1-snappy.parquet&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;but the coutryCode column is removed from the parquet file.&lt;/P&gt;&lt;P&gt;In my case the parquet file is to be read by external consumers and they expect the coutryCode column in file.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is there an option to have the column in the file and also in folder path.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Dec 2021 06:22:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32476#M23670</guid>
      <dc:creator>guruv</dc:creator>
      <dc:date>2021-12-28T06:22:53Z</dc:date>
    </item>
    <item>
      <title>Re: parquet file to include partitioned column in file</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32477#M23671</link>
      <description>&lt;P&gt;Most external consumers will read partition as column when are properly configured (for example Azure Data Factory or Power BI).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Only way around is that you will duplicate column with other name (you can not have the same name as it will generate conflicts in appends and reads from many clients):&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;.withColumn("foo_", col("foo"))&lt;/P&gt;</description>
      <pubDate>Tue, 28 Dec 2021 10:45:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32477#M23671</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-12-28T10:45:32Z</dc:date>
    </item>
    <item>
      <title>Re: parquet file to include partitioned column in file</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32478#M23672</link>
      <description>&lt;P&gt;Thanks for reply. Can you suggest consumers when reading custom code to read files can get partitional column?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;presently consumer is getting list of all files in folder, and filtering out files already processed and then read each new file with &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.read.format('parquet').load(filePath)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 28 Dec 2021 13:05:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32478#M23672</guid>
      <dc:creator>guruv</dc:creator>
      <dc:date>2021-12-28T13:05:57Z</dc:date>
    </item>
    <item>
      <title>Re: parquet file to include partitioned column in file</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32479#M23673</link>
      <description>&lt;UL&gt;&lt;LI&gt;please try to add .option("mergeSchema", "true") &lt;/LI&gt;&lt;LI&gt;in filePath just specify main top folder with partitions (root folder for parquet dataset)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here is official documentation about partition discovery &lt;A href="https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery" target="test_blank"&gt;https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Dec 2021 09:58:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32479#M23673</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-12-29T09:58:09Z</dc:date>
    </item>
    <item>
      <title>Re: parquet file to include partitioned column in file</title>
      <link>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32480#M23674</link>
      <description>&lt;P&gt;Thanks will try and check back in case of any other issue.&lt;/P&gt;</description>
      <pubDate>Wed, 29 Dec 2021 19:28:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parquet-file-to-include-partitioned-column-in-file/m-p/32480#M23674</guid>
      <dc:creator>guruv</dc:creator>
      <dc:date>2021-12-29T19:28:53Z</dc:date>
    </item>
  </channel>
</rss>

