cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

Jitu
New Contributor II

Zipped csv files are receiving to s3 raw layer.

6 REPLIES 6

Anonymous
Not applicable

Why can't you unzip it? You can not read zipped files with spark as zip isn't a file type. https://docs.databricks.com/files/unzip-files.html has some instructions on how to unzip them and read them.

Additionally, if you don't want or can't unzip whole archive, you can list the contents of the archive and unzip only selected file.

Still, as @Joseph Kambourakis​ asked - why can't you just unzip it? What's blocking you?

belbert
New Contributor II

We encountered a similar issue, but for gzip files. If you can convert your files to gzip instead of ZIP, it is as easy as the following (in PySpark)

df = spark.read.option("header", "true").csv(PATH + "/*.csv.gz")

As best as I can tell, this is not possible with ZIP files, but if you have a place where you can write the output to, writing a Python or Scala script to unzip and then gzip the files should not be too hard [if keeping them compressed is required, else do what @Joseph Kambourakis​ said and just unzip them 🙂 ]

Great you pointed out @Ben Elbert​ that spark allows to read compressed files (`compression` property mentioned here: https://spark.apache.org/docs/latest/sql-data-sources-csv.html). Still, it won't work with .zip archive.

Bartek
Contributor

one more solution - you can read .zip using old good pandas `read_csv` method (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

import pandas as pd
 
simple_csv_zipped = pd.read_csv("/dbfs/FileStore/simple_file.zip")

still there is one disclaimer: "If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in."

and there is also obvious trade-off: using pandas means no distribution, no scalability and exposure to OOM errors - but maybe in your specific case it is acceptable

Chaitanya_Raju
Honored Contributor

@Jog Giri​  I also recently encountered a similar scenario, the below code solved my purpose without any issues.

import zipfile
for i in dbutils.fs.ls('/mnt/zipfilespath/'):
  with zipfile.ZipFile(i.path.replace('dbfs:','/dbfs'), mode="r") as zip_ref:
    zip_ref.extractall(destination_path)

where I mounted an ADLS Gen 2 container which consists of several .csv zip files, please let me know if you face any further issues, happy to help!!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.