<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Working with a text file that is both compressed by bz2 followed by zip in PySpark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/working-with-a-text-file-that-is-both-compressed-by-bz2-followed/m-p/62982#M32144</link>
    <description>&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;SPAN&gt;I have downloaded Am azon reviews for sentiment analysis from here. The file is not particularly large (just over 500MB) but comes in the following format&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;test.ft.txt.bz2.zip&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;So it is a text file that is compressed by bz2 followed by zip. Now I like to do all these operations in PySpark. In PySpark a file cannot have both .bz2 and .zip simultaneously..&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;The way I do it is to &amp;nbsp;place the downloaded file in a local directory. Then just do some operations that are simple but messy.. I try to unzip the file using zipfile package. This works with bash style filename. as opposed to python style filename "file:///.." This necessitates&amp;nbsp;using different style, one for OS type for zip and the other Python style to read bz2 file directory into df in&amp;nbsp;&lt;/SPAN&gt;PySpark&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import os
import zipfile
data_path = "file:///d4T/hmduser/sentiments/"
input_file_path = os.path.join(data_path, "test.ft.txt.bz2")
output_file_path = os.path.join(data_path, "review_text_file")
dir_name = "/d4T/hmduser/sentiments/"
zipped_file=os.path.join(dir_name, "test.ft.txt.bz2.zip")
bz2_file=os.path.join(dir_name, "test.ft.txt.bz2")
try:
    # Unzip the file
    with zipfile.ZipFile(zipped_file, 'r') as zip_ref:
        zip_ref.extractall(os.path.dirname(bz2_file))
   
    # Now bz2_file should contain the path to the unzipped file
    print(f"Unzipped file: {bz2_file}")
except Exception as e:
    print(f"Error during unzipping: {str(e)}")

# Load the bz2 file into a DataFrame
df = spark.read.text(input_file_path)
# Remove the '__label__1' and '__label__2' prefixes
df = df.withColumn("review_text", expr("regexp_replace(value, '__label__[12] ', '')"))​&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Then the rest is just spark-ml&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Once I finished I remove the bz2 file to&amp;nbsp;&lt;/SPAN&gt;clean-up&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;if os.path.exists(bz2_file):  # Check if bz2 file exists
  try:
    os.remove(bz2_file)
    print(f"Successfully deleted {bz2_file}")
  except OSError as e:
    print(f"Error deleting {bz2_file}: {e}")
else:
    print(f"bz2 file {bz2_file} could not be found")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;My question is can these operations be done more efficiently&amp;nbsp;in Pyspark itself ideally with one df operation reading the original file (.bz2.zip)?&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Thanks&lt;BR /&gt;&lt;BR /&gt;Mich Talebzadeh,&lt;BR /&gt;Dad | Technologist | Solutions Architect | Engineer&lt;BR /&gt;London&lt;BR /&gt;United Kingdom&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;TABLE cellpadding="0"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&amp;nbsp;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Fri, 08 Mar 2024 00:24:42 GMT</pubDate>
    <dc:creator>MichTalebzadeh</dc:creator>
    <dc:date>2024-03-08T00:24:42Z</dc:date>
    <item>
      <title>Working with a text file that is both compressed by bz2 followed by zip in PySpark</title>
      <link>https://community.databricks.com/t5/data-engineering/working-with-a-text-file-that-is-both-compressed-by-bz2-followed/m-p/62982#M32144</link>
      <description>&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;SPAN&gt;I have downloaded Am azon reviews for sentiment analysis from here. The file is not particularly large (just over 500MB) but comes in the following format&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;test.ft.txt.bz2.zip&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;So it is a text file that is compressed by bz2 followed by zip. Now I like to do all these operations in PySpark. In PySpark a file cannot have both .bz2 and .zip simultaneously..&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;The way I do it is to &amp;nbsp;place the downloaded file in a local directory. Then just do some operations that are simple but messy.. I try to unzip the file using zipfile package. This works with bash style filename. as opposed to python style filename "file:///.." This necessitates&amp;nbsp;using different style, one for OS type for zip and the other Python style to read bz2 file directory into df in&amp;nbsp;&lt;/SPAN&gt;PySpark&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import os
import zipfile
data_path = "file:///d4T/hmduser/sentiments/"
input_file_path = os.path.join(data_path, "test.ft.txt.bz2")
output_file_path = os.path.join(data_path, "review_text_file")
dir_name = "/d4T/hmduser/sentiments/"
zipped_file=os.path.join(dir_name, "test.ft.txt.bz2.zip")
bz2_file=os.path.join(dir_name, "test.ft.txt.bz2")
try:
    # Unzip the file
    with zipfile.ZipFile(zipped_file, 'r') as zip_ref:
        zip_ref.extractall(os.path.dirname(bz2_file))
   
    # Now bz2_file should contain the path to the unzipped file
    print(f"Unzipped file: {bz2_file}")
except Exception as e:
    print(f"Error during unzipping: {str(e)}")

# Load the bz2 file into a DataFrame
df = spark.read.text(input_file_path)
# Remove the '__label__1' and '__label__2' prefixes
df = df.withColumn("review_text", expr("regexp_replace(value, '__label__[12] ', '')"))​&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Then the rest is just spark-ml&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Once I finished I remove the bz2 file to&amp;nbsp;&lt;/SPAN&gt;clean-up&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;if os.path.exists(bz2_file):  # Check if bz2 file exists
  try:
    os.remove(bz2_file)
    print(f"Successfully deleted {bz2_file}")
  except OSError as e:
    print(f"Error deleting {bz2_file}: {e}")
else:
    print(f"bz2 file {bz2_file} could not be found")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;My question is can these operations be done more efficiently&amp;nbsp;in Pyspark itself ideally with one df operation reading the original file (.bz2.zip)?&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Thanks&lt;BR /&gt;&lt;BR /&gt;Mich Talebzadeh,&lt;BR /&gt;Dad | Technologist | Solutions Architect | Engineer&lt;BR /&gt;London&lt;BR /&gt;United Kingdom&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;TABLE cellpadding="0"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&amp;nbsp;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 08 Mar 2024 00:24:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/working-with-a-text-file-that-is-both-compressed-by-bz2-followed/m-p/62982#M32144</guid>
      <dc:creator>MichTalebzadeh</dc:creator>
      <dc:date>2024-03-08T00:24:42Z</dc:date>
    </item>
    <item>
      <title>Re: Working with a text file that is both compressed by bz2 followed by zip in PySpark</title>
      <link>https://community.databricks.com/t5/data-engineering/working-with-a-text-file-that-is-both-compressed-by-bz2-followed/m-p/63046#M32158</link>
      <description>&lt;P&gt;Thanks for your reply&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;On the face of it spark can handle both .bz2 and .zip . It practice it does not work with both at the same time. You end up with ineligible characters as text. I suspect it handles decompression of outer layer (in this case unzip) but leaves the other one as is..&amp;nbsp;Sorry I could not post it.&amp;nbsp;&lt;/P&gt;&lt;P&gt;In other words, PySpark can do one unzip or bz2 -d but not both at the same time.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Cheers&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Mar 2024 12:08:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/working-with-a-text-file-that-is-both-compressed-by-bz2-followed/m-p/63046#M32158</guid>
      <dc:creator>MichTalebzadeh</dc:creator>
      <dc:date>2024-03-08T12:08:25Z</dc:date>
    </item>
  </channel>
</rss>

