<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Spark read GZ file as corrupted data, when file extension having .GZ in upper case in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/57974#M30974</link>
    <description>&lt;P&gt;I assume, if a &lt;STRONG&gt;.gz&lt;/STRONG&gt; file is renamed as &lt;STRONG&gt;.GZ&lt;/STRONG&gt; purposefully then we need to consider it as valid file format as gzip file. Cause that &lt;STRONG&gt;.GZ&lt;/STRONG&gt; file still consist compressed data which is still valid.&lt;/P&gt;</description>
    <pubDate>Sun, 21 Jan 2024 07:05:12 GMT</pubDate>
    <dc:creator>hari-prasad</dc:creator>
    <dc:date>2024-01-21T07:05:12Z</dc:date>
    <item>
      <title>Spark read GZ file as corrupted data, when file extension having .GZ in upper case</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/57883#M30939</link>
      <description>&lt;P&gt;if file is renamed with file_name.sv.gz (lower case extension) is working fine, if&amp;nbsp;file_name.sv.GZ (upper case extension) the data is read as corrupted, means it simply reading compressed file as is.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hprasad_0-1705667590987.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/5874iDEBACD358DB0ACC6/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="hprasad_0-1705667590987.png" alt="hprasad_0-1705667590987.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Jan 2024 12:34:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/57883#M30939</guid>
      <dc:creator>hari-prasad</dc:creator>
      <dc:date>2024-01-19T12:34:59Z</dc:date>
    </item>
    <item>
      <title>Re: Spark read GZ file as corrupted data, when file extension having .GZ in upper case</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/57887#M30942</link>
      <description>&lt;P&gt;I don't think .GZ(upper case) is a valid file extension. I have seen most systems compress the file using .gz(lower case) extension&lt;/P&gt;</description>
      <pubDate>Fri, 19 Jan 2024 14:33:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/57887#M30942</guid>
      <dc:creator>Lakshay</dc:creator>
      <dc:date>2024-01-19T14:33:32Z</dc:date>
    </item>
    <item>
      <title>Re: Spark read GZ file as corrupted data, when file extension having .GZ in upper case</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/57974#M30974</link>
      <description>&lt;P&gt;I assume, if a &lt;STRONG&gt;.gz&lt;/STRONG&gt; file is renamed as &lt;STRONG&gt;.GZ&lt;/STRONG&gt; purposefully then we need to consider it as valid file format as gzip file. Cause that &lt;STRONG&gt;.GZ&lt;/STRONG&gt; file still consist compressed data which is still valid.&lt;/P&gt;</description>
      <pubDate>Sun, 21 Jan 2024 07:05:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/57974#M30974</guid>
      <dc:creator>hari-prasad</dc:creator>
      <dc:date>2024-01-21T07:05:12Z</dc:date>
    </item>
    <item>
      <title>Re: Spark read GZ file as corrupted data, when file extension having .GZ in upper case</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/58176#M31035</link>
      <description>&lt;P&gt;Agree but &lt;SPAN&gt;Spark infers the compression from your filename and Spark cannot infer the compression from .GZ format. You can read more about this in below article:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;A href="https://aws.plainenglish.io/demystifying-apache-spark-quirks-2c91ba2d3978" target="_blank" rel="noopener"&gt;https://aws.plainenglish.io/demystifying-apache-spark-quirks-2c91ba2d3978&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jan 2024 15:57:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/58176#M31035</guid>
      <dc:creator>Lakshay</dc:creator>
      <dc:date>2024-01-22T15:57:23Z</dc:date>
    </item>
    <item>
      <title>Re: Spark read GZ file as corrupted data, when file extension having .GZ in upper case</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/58182#M31037</link>
      <description>&lt;P&gt;Yup, Spark does infer it from filename, I have been through spark code in Github.&lt;/P&gt;&lt;P&gt;Article is also refering to the internal code from Spark library.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I assume we can add an exception to handle .GZ file as gzip by tweaking spark libraries.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jan 2024 16:27:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/58182#M31037</guid>
      <dc:creator>hari-prasad</dc:creator>
      <dc:date>2024-01-22T16:27:35Z</dc:date>
    </item>
    <item>
      <title>Re: Spark read GZ file as corrupted data, when file extension having .GZ in upper case</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/58183#M31038</link>
      <description>&lt;P&gt;Yes, we can do it but is it worth doing it? This is something you can discuss in a Jira ticket.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jan 2024 16:34:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/58183#M31038</guid>
      <dc:creator>Lakshay</dc:creator>
      <dc:date>2024-01-22T16:34:57Z</dc:date>
    </item>
    <item>
      <title>Re: Spark read GZ file as corrupted data, when file extension having .GZ in upper case</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/58184#M31039</link>
      <description>&lt;P&gt;I assume it should worth handling such thing, as filename or extension should not be a constraint to process data.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;As we know it's a gzip file and we can pass the paramter to read it as gzip.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks a lot for your responses &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/75976"&gt;@Lakshay&lt;/a&gt;&amp;nbsp;.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jan 2024 16:44:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/58184#M31039</guid>
      <dc:creator>hari-prasad</dc:creator>
      <dc:date>2024-01-22T16:44:11Z</dc:date>
    </item>
    <item>
      <title>Re: Spark read GZ file as corrupted data, when file extension having .GZ in upper case</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/58185#M31040</link>
      <description>&lt;P&gt;Happy to help!&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jan 2024 16:48:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/58185#M31040</guid>
      <dc:creator>Lakshay</dc:creator>
      <dc:date>2024-01-22T16:48:47Z</dc:date>
    </item>
    <item>
      <title>Re: Spark read GZ file as corrupted data, when file extension having .GZ in upper case</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/82629#M36702</link>
      <description>&lt;P&gt;Recently I restarted look at a solution for this issue, I found out we can add few exception for allowing "GZ" in hadoop library as GzipCodec is invoked from there.&lt;/P&gt;</description>
      <pubDate>Sat, 10 Aug 2024 10:47:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-read-gz-file-as-corrupted-data-when-file-extension-having/m-p/82629#M36702</guid>
      <dc:creator>hari-prasad</dc:creator>
      <dc:date>2024-08-10T10:47:52Z</dc:date>
    </item>
  </channel>
</rss>

