<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Issue in reading parquet file in pyspark databricks. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/issue-in-reading-parquet-file-in-pyspark-databricks/m-p/31504#M22942</link>
    <description>&lt;P&gt;One of the source systems generates from time to time a parquet file which is only 220kb in size.&lt;/P&gt;&lt;P&gt;But reading it fails.&lt;/P&gt;&lt;P&gt;"java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquet&lt;/P&gt;&lt;P&gt;Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_32);&lt;/P&gt;&lt;P&gt;"&lt;/P&gt;&lt;P&gt;I tried to use a schema and mergeSchema option&lt;/P&gt;&lt;P&gt;df =spark.read.options(mergeSchema=True).schema(mdd_schema_struct).parquet(target)&lt;/P&gt;&lt;P&gt;This is able to read the file and display but if you run count or merge it it would fail with &lt;/P&gt;&lt;P&gt;"Caused by: java.lang.RuntimeException: Illegal row group of 0 rows"&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Does anyone know what could be the issue.&lt;/P&gt;</description>
    <pubDate>Mon, 17 Jan 2022 15:49:47 GMT</pubDate>
    <dc:creator>irfanaziz</dc:creator>
    <dc:date>2022-01-17T15:49:47Z</dc:date>
    <item>
      <title>Issue in reading parquet file in pyspark databricks.</title>
      <link>https://community.databricks.com/t5/data-engineering/issue-in-reading-parquet-file-in-pyspark-databricks/m-p/31504#M22942</link>
      <description>&lt;P&gt;One of the source systems generates from time to time a parquet file which is only 220kb in size.&lt;/P&gt;&lt;P&gt;But reading it fails.&lt;/P&gt;&lt;P&gt;"java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquet&lt;/P&gt;&lt;P&gt;Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_32);&lt;/P&gt;&lt;P&gt;"&lt;/P&gt;&lt;P&gt;I tried to use a schema and mergeSchema option&lt;/P&gt;&lt;P&gt;df =spark.read.options(mergeSchema=True).schema(mdd_schema_struct).parquet(target)&lt;/P&gt;&lt;P&gt;This is able to read the file and display but if you run count or merge it it would fail with &lt;/P&gt;&lt;P&gt;"Caused by: java.lang.RuntimeException: Illegal row group of 0 rows"&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Does anyone know what could be the issue.&lt;/P&gt;</description>
      <pubDate>Mon, 17 Jan 2022 15:49:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issue-in-reading-parquet-file-in-pyspark-databricks/m-p/31504#M22942</guid>
      <dc:creator>irfanaziz</dc:creator>
      <dc:date>2022-01-17T15:49:47Z</dc:date>
    </item>
    <item>
      <title>Re: Issue in reading parquet file in pyspark databricks.</title>
      <link>https://community.databricks.com/t5/data-engineering/issue-in-reading-parquet-file-in-pyspark-databricks/m-p/31505#M22943</link>
      <description>&lt;P&gt;Seems that file is corrupted maybe you can ignore them by setting:&lt;/P&gt;&lt;P&gt;spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;you can also check that setting:&lt;/P&gt;&lt;P&gt;sqlContext.setConf("spark.sql.parquet.filterPushdown","false")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;you can register your files as table (pointed to that location with files) with correct schema set and than try to run:&lt;/P&gt;&lt;P&gt;%sql&lt;/P&gt;&lt;P&gt;MSCK REPAIR TABLE table_name&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html" target="test_blank"&gt;https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 17 Jan 2022 16:14:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issue-in-reading-parquet-file-in-pyspark-databricks/m-p/31505#M22943</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-01-17T16:14:43Z</dc:date>
    </item>
    <item>
      <title>Re: Issue in reading parquet file in pyspark databricks.</title>
      <link>https://community.databricks.com/t5/data-engineering/issue-in-reading-parquet-file-in-pyspark-databricks/m-p/31506#M22944</link>
      <description>&lt;P&gt;Yes i had to use the badRows option. Which put the bad files to a given path.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Feb 2022 14:52:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issue-in-reading-parquet-file-in-pyspark-databricks/m-p/31506#M22944</guid>
      <dc:creator>irfanaziz</dc:creator>
      <dc:date>2022-02-08T14:52:28Z</dc:date>
    </item>
    <item>
      <title>Re: Issue in reading parquet file in pyspark databricks.</title>
      <link>https://community.databricks.com/t5/data-engineering/issue-in-reading-parquet-file-in-pyspark-databricks/m-p/31507#M22945</link>
      <description>&lt;P&gt;@nafri A​&amp;nbsp;- Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek​'s answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; &lt;/P&gt;</description>
      <pubDate>Wed, 09 Feb 2022 16:13:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issue-in-reading-parquet-file-in-pyspark-databricks/m-p/31507#M22945</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-02-09T16:13:04Z</dc:date>
    </item>
  </channel>
</rss>

