Databricks Community

joggiri · ‎12-12-2022

Zipped csv files are receiving to s3 raw layer.

Anonymous · ‎12-13-2022

Why can't you unzip it? You can not read zipped files with spark as zip isn't a file type. https://docs.databricks.com/files/unzip-files.html has some instructions on how to unzip them and read them.

Bartek · ‎12-13-2022

Additionally, if you don't want or can't unzip whole archive, you can list the contents of the archive and unzip only selected file.

Still, as @Joseph Kambourakis asked - why can't you just unzip it? What's blocking you?

belbert · ‎12-14-2022

We encountered a similar issue, but for gzip files. If you can convert your files to gzip instead of ZIP, it is as easy as the following (in PySpark)

df = spark.read.option("header", "true").csv(PATH + "/*.csv.gz")

As best as I can tell, this is not possible with ZIP files, but if you have a place where you can write the output to, writing a Python or Scala script to unzip and then gzip the files should not be too hard [if keeping them compressed is required, else do what @Joseph Kambourakis said and just unzip them 🙂 ]

Bartek · ‎12-14-2022

Great you pointed out @Ben Elbert that spark allows to read compressed files (`compression` property mentioned here: https://spark.apache.org/docs/latest/sql-data-sources-csv.html). Still, it won't work with .zip archive.

Bartek · ‎12-14-2022

one more solution - you can read .zip using old good pandas `read_csv` method (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

import pandas as pd
 
simple_csv_zipped = pd.read_csv("/dbfs/FileStore/simple_file.zip")

still there is one disclaimer: "If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in."

and there is also obvious trade-off: using pandas means no distribution, no scalability and exposure to OOM errors - but maybe in your specific case it is acceptable

Chaitanya_Raju · ‎12-16-2022

@Jog Giri I also recently encountered a similar scenario, the below code solved my purpose without any issues.

import zipfile
for i in dbutils.fs.ls('/mnt/zipfilespath/'):
  with zipfile.ZipFile(i.path.replace('dbfs:','/dbfs'), mode="r") as zip_ref:
    zip_ref.extractall(destination_path)

where I mounted an ADLS Gen 2 container which consists of several .csv zip files, please let me know if you face any further issues, happy to help!!

Thanks for reading and like if this is useful and for improvements or feedback please comment.

Databricks Community

I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!