I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-12-2022 11:41 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-13-2022 03:55 AM
Why can't you unzip it? You can not read zipped files with spark as zip isn't a file type. https://docs.databricks.com/files/unzip-files.html has some instructions on how to unzip them and read them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-13-2022 02:07 PM
Additionally, if you don't want or can't unzip whole archive, you can list the contents of the archive and unzip only selected file.
Still, as @Joseph Kambourakisโ asked - why can't you just unzip it? What's blocking you?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-14-2022 07:57 AM
We encountered a similar issue, but for gzip files. If you can convert your files to gzip instead of ZIP, it is as easy as the following (in PySpark)
df = spark.read.option("header", "true").csv(PATH + "/*.csv.gz")
As best as I can tell, this is not possible with ZIP files, but if you have a place where you can write the output to, writing a Python or Scala script to unzip and then gzip the files should not be too hard [if keeping them compressed is required, else do what @Joseph Kambourakisโ said and just unzip them ๐ ]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-14-2022 08:29 AM
Great you pointed out @Ben Elbertโ that spark allows to read compressed files (`compression` property mentioned here: https://spark.apache.org/docs/latest/sql-data-sources-csv.html). Still, it won't work with .zip archive.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-14-2022 01:53 PM
one more solution - you can read .zip using old good pandas `read_csv` method (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)
import pandas as pd
simple_csv_zipped = pd.read_csv("/dbfs/FileStore/simple_file.zip")
still there is one disclaimer: "If using โzipโ or โtarโ, the ZIP file must contain only one data file to be read in."
and there is also obvious trade-off: using pandas means no distribution, no scalability and exposure to OOM errors - but maybe in your specific case it is acceptable
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-16-2022 06:42 PM
@Jog Giriโ I also recently encountered a similar scenario, the below code solved my purpose without any issues.
import zipfile
for i in dbutils.fs.ls('/mnt/zipfilespath/'):
with zipfile.ZipFile(i.path.replace('dbfs:','/dbfs'), mode="r") as zip_ref:
zip_ref.extractall(destination_path)
where I mounted an ADLS Gen 2 container which consists of several .csv zip files, please let me know if you face any further issues, happy to help!!

