Working with a text file that is both compressed by bz2 followed by zip in PySpark

MichTalebzadeh — Fri, 08 Mar 2024 00:24:42 GMT

I have downloaded Am azon reviews for sentiment analysis from here. The file is not particularly large (just over 500MB) but comes in the following format

test.ft.txt.bz2.zip

So it is a text file that is compressed by bz2 followed by zip. Now I like to do all these operations in PySpark. In PySpark a file cannot have both .bz2 and .zip simultaneously..

The way I do it is to place the downloaded file in a local directory. Then just do some operations that are simple but messy.. I try to unzip the file using zipfile package. This works with bash style filename. as opposed to python style filename "file:///.." This necessitates using different style, one for OS type for zip and the other Python style to read bz2 file directory into df in PySpark

import os import zipfile data_path = "file:///d4T/hmduser/sentiments/" input_file_path = os.path.join(data_path, "test.ft.txt.bz2") output_file_path = os.path.join(data_path, "review_text_file") dir_name = "/d4T/hmduser/sentiments/" zipped_file=os.path.join(dir_name, "test.ft.txt.bz2.zip") bz2_file=os.path.join(dir_name, "test.ft.txt.bz2") try: # Unzip the file with zipfile.ZipFile(zipped_file, 'r') as zip_ref: zip_ref.extractall(os.path.dirname(bz2_file)) # Now bz2_file should contain the path to the unzipped file print(f"Unzipped file: {bz2_file}") except Exception as e: print(f"Error during unzipping: {str(e)}") # Load the bz2 file into a DataFrame df = spark.read.text(input_file_path) # Remove the '__label__1' and '__label__2' prefixes df = df.withColumn("review_text", expr("regexp_replace(value, '__label__[12] ', '')"))

Then the rest is just spark-ml

Once I finished I remove the bz2 file to clean-up

if os.path.exists(bz2_file): # Check if bz2 file exists try: os.remove(bz2_file) print(f"Successfully deleted {bz2_file}") except OSError as e: print(f"Error deleting {bz2_file}: {e}") else: print(f"bz2 file {bz2_file} could not be found")

My question is can these operations be done more efficiently in Pyspark itself ideally with one df operation reading the original file (.bz2.zip)?

Thanks

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom

Re: Working with a text file that is both compressed by bz2 followed by zip in PySpark

MichTalebzadeh — Fri, 08 Mar 2024 12:08:25 GMT

Thanks for your reply @Retired_mod

On the face of it spark can handle both .bz2 and .zip . It practice it does not work with both at the same time. You end up with ineligible characters as text. I suspect it handles decompression of outer layer (in this case unzip) but leaves the other one as is.. Sorry I could not post it.

In other words, PySpark can do one unzip or bz2 -d but not both at the same time.

Cheers

topic Working with a text file that is both compressed by bz2 followed by zip in PySpark in Data Engineering

Working with a text file that is both compressed by bz2 followed by zip in PySpark

Re: Working with a text file that is both compressed by bz2 followed by zip in PySpark