<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic running a query against multiple parquet files from a folder in Warehousing &amp; Analytics</title>
    <link>https://community.databricks.com/t5/warehousing-analytics/running-a-query-against-multiple-parquet-files-from-a-folder/m-p/75730#M1416</link>
    <description>&lt;P&gt;I am runninng a query against multiple parquet files:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;SELECT
SUM(CASE WHEN match_result.year_incorporated IS NOT NULL AND match_result.year_incorporated != '' THEN 1 ELSE 0 END)
FROM 
parquet.`s3://folder_path/*`&lt;/LI-CODE&gt;&lt;P&gt;for some files, the field `year_incorporated` has a string value, and for some of files the entire field is null. I am getting this error for the file with all null values:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Error while reading file s3://file_path.PARQUET. Schema conversion error: cannot convert Parquet type INT32 to Photon type string(0)&lt;/LI-CODE&gt;&lt;P&gt;How can I fix this issue?&lt;/P&gt;</description>
    <pubDate>Tue, 25 Jun 2024 15:13:21 GMT</pubDate>
    <dc:creator>Shaimaa</dc:creator>
    <dc:date>2024-06-25T15:13:21Z</dc:date>
    <item>
      <title>running a query against multiple parquet files from a folder</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/running-a-query-against-multiple-parquet-files-from-a-folder/m-p/75730#M1416</link>
      <description>&lt;P&gt;I am runninng a query against multiple parquet files:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;SELECT
SUM(CASE WHEN match_result.year_incorporated IS NOT NULL AND match_result.year_incorporated != '' THEN 1 ELSE 0 END)
FROM 
parquet.`s3://folder_path/*`&lt;/LI-CODE&gt;&lt;P&gt;for some files, the field `year_incorporated` has a string value, and for some of files the entire field is null. I am getting this error for the file with all null values:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Error while reading file s3://file_path.PARQUET. Schema conversion error: cannot convert Parquet type INT32 to Photon type string(0)&lt;/LI-CODE&gt;&lt;P&gt;How can I fix this issue?&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jun 2024 15:13:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/running-a-query-against-multiple-parquet-files-from-a-folder/m-p/75730#M1416</guid>
      <dc:creator>Shaimaa</dc:creator>
      <dc:date>2024-06-25T15:13:21Z</dc:date>
    </item>
    <item>
      <title>Re: running a query against multiple parquet files from a folder</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/running-a-query-against-multiple-parquet-files-from-a-folder/m-p/75774#M1417</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/108540"&gt;@Shaimaa&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;The column type mismatch between the files could be an issue here.&lt;BR /&gt;For example: if in one file column 'xyz' is a type of INTEGER and in another one the same column is a type of STRING, Spark will give you a schema conversion error.&lt;BR /&gt;Below is a link for a good article that explains the issue a little bit more, however the best solution would be to fix the column types in the source files or by changing the file format.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;A href="https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce" target="_blank" rel="noopener"&gt;https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 26 Jun 2024 06:30:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/running-a-query-against-multiple-parquet-files-from-a-folder/m-p/75774#M1417</guid>
      <dc:creator>daniel_sahal</dc:creator>
      <dc:date>2024-06-26T06:30:45Z</dc:date>
    </item>
  </channel>
</rss>

