topic I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that? in Data Governance

I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

joggiri — Tue, 13 Dec 2022 07:41:23 GMT

Zipped csv files are receiving to s3 raw layer.

Re: I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

Anonymous — Tue, 13 Dec 2022 11:55:09 GMT

Why can't you unzip it? You can not read zipped files with spark as zip isn't a file type. https://docs.databricks.com/files/unzip-files.html has some instructions on how to unzip them and read them.

Re: I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

Bartek — Tue, 13 Dec 2022 22:07:40 GMT

Additionally, if you don't want or can't unzip whole archive, you can list the contents of the archive and unzip only selected file.

Still, as @Joseph Kambourakis asked - why can't you just unzip it? What's blocking you?

Re: I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

belbert — Wed, 14 Dec 2022 15:57:19 GMT

We encountered a similar issue, but for gzip files. If you can convert your files to gzip instead of ZIP, it is as easy as the following (in PySpark)

df = spark.read.option("header", "true").csv(PATH + "/*.csv.gz")

As best as I can tell, this is not possible with ZIP files, but if you have a place where you can write the output to, writing a Python or Scala script to unzip and then gzip the files should not be too hard [if keeping them compressed is required, else do what @Joseph Kambourakis said and just unzip them 🙂 ]

Re: I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

Bartek — Wed, 14 Dec 2022 16:29:41 GMT

Great you pointed out @Ben Elbert that spark allows to read compressed files (`compression` property mentioned here: https://spark.apache.org/docs/latest/sql-data-sources-csv.html). Still, it won't work with .zip archive.

Re: I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

Bartek — Wed, 14 Dec 2022 21:53:16 GMT

one more solution - you can read .zip using old good pandas `read_csv` method (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

import pandas as pd
 
simple_csv_zipped = pd.read_csv("/dbfs/FileStore/simple_file.zip")

still there is one disclaimer: "If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in."

and there is also obvious trade-off: using pandas means no distribution, no scalability and exposure to OOM errors - but maybe in your specific case it is acceptable

Re: I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

Chaitanya_Raju — Sat, 17 Dec 2022 02:42:57 GMT

@Jog Giri I also recently encountered a similar scenario, the below code solved my purpose without any issues.

import zipfile
for i in dbutils.fs.ls('/mnt/zipfilespath/'):
  with zipfile.ZipFile(i.path.replace('dbfs:','/dbfs'), mode="r") as zip_ref:
    zip_ref.extractall(destination_path)

where I mounted an ADLS Gen 2 container which consists of several .csv zip files, please let me know if you face any further issues, happy to help!!