cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Governance
Join discussions on data governance practices, compliance, and security within the Databricks Community. Exchange strategies and insights to ensure data integrity and regulatory compliance.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

Jitu
New Contributor II

Zipped csv files are receiving to s3 raw layer.

6 REPLIES 6

Anonymous
Not applicable

Why can't you unzip it? You can not read zipped files with spark as zip isn't a file type. https://docs.databricks.com/files/unzip-files.html has some instructions on how to unzip them and read them.

Additionally, if you don't want or can't unzip whole archive, you can list the contents of the archive and unzip only selected file.

Still, as @Joseph Kambourakisโ€‹ asked - why can't you just unzip it? What's blocking you?

belbert
New Contributor II

We encountered a similar issue, but for gzip files. If you can convert your files to gzip instead of ZIP, it is as easy as the following (in PySpark)

df = spark.read.option("header", "true").csv(PATH + "/*.csv.gz")

As best as I can tell, this is not possible with ZIP files, but if you have a place where you can write the output to, writing a Python or Scala script to unzip and then gzip the files should not be too hard [if keeping them compressed is required, else do what @Joseph Kambourakisโ€‹ said and just unzip them ๐Ÿ™‚ ]

Great you pointed out @Ben Elbertโ€‹ that spark allows to read compressed files (`compression` property mentioned here: https://spark.apache.org/docs/latest/sql-data-sources-csv.html). Still, it won't work with .zip archive.

Bartek
Contributor

one more solution - you can read .zip using old good pandas `read_csv` method (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

import pandas as pd
 
simple_csv_zipped = pd.read_csv("/dbfs/FileStore/simple_file.zip")

still there is one disclaimer: "If using โ€˜zipโ€™ or โ€˜tarโ€™, the ZIP file must contain only one data file to be read in."

and there is also obvious trade-off: using pandas means no distribution, no scalability and exposure to OOM errors - but maybe in your specific case it is acceptable

Chaitanya_Raju
Honored Contributor

@Jog Giriโ€‹  I also recently encountered a similar scenario, the below code solved my purpose without any issues.

import zipfile
for i in dbutils.fs.ls('/mnt/zipfilespath/'):
  with zipfile.ZipFile(i.path.replace('dbfs:','/dbfs'), mode="r") as zip_ref:
    zip_ref.extractall(destination_path)

where I mounted an ADLS Gen 2 container which consists of several .csv zip files, please let me know if you face any further issues, happy to help!!

Thanks for reading and like if this is useful and for improvements or feedback please comment.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group