Databricks

Jitu · ‎12-12-2022

Zipped csv files are receiving to s3 raw layer.

Anonymous · ‎12-13-2022

Why can't you unzip it? You can not read zipped files with spark as zip isn't a file type. https://docs.databricks.com/files/unzip-files.html has some instructions on how to unzip them and read them.

Bartek · ‎12-13-2022

Additionally, if you don't want or can't unzip whole archive, you can list the contents of the archive and unzip only selected file.

Still, as @Joseph Kambourakis asked - why can't you just unzip it? What's blocking you?

belbert · ‎12-14-2022

We encountered a similar issue, but for gzip files. If you can convert your files to gzip instead of ZIP, it is as easy as the following (in PySpark)

df = spark.read.option("header", "true").csv(PATH + "/*.csv.gz")

As best as I can tell, this is not possible with ZIP files, but if you have a place where you can write the output to, writing a Python or Scala script to unzip and then gzip the files should not be too hard [if keeping them compressed is required, else do what @Joseph Kambourakis said and just unzip them 🙂 ]

Bartek · ‎12-14-2022

Great you pointed out @Ben Elbert that spark allows to read compressed files (`compression` property mentioned here: https://spark.apache.org/docs/latest/sql-data-sources-csv.html). Still, it won't work with .zip archive.

Bartek · ‎12-14-2022

one more solution - you can read .zip using old good pandas `read_csv` method (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

import pandas as pd
 
simple_csv_zipped = pd.read_csv("/dbfs/FileStore/simple_file.zip")

still there is one disclaimer: "If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in."

and there is also obvious trade-off: using pandas means no distribution, no scalability and exposure to OOM errors - but maybe in your specific case it is acceptable

Chaitanya_Raju · ‎12-16-2022

@Jog Giri I also recently encountered a similar scenario, the below code solved my purpose without any issues.

import zipfile
for i in dbutils.fs.ls('/mnt/zipfilespath/'):
  with zipfile.ZipFile(i.path.replace('dbfs:','/dbfs'), mode="r") as zip_ref:
    zip_ref.extractall(destination_path)

where I mounted an ADLS Gen 2 container which consists of several .csv zip files, please let me know if you face any further issues, happy to help!!

Databricks

I have to read zipped csv file using spark without unzipping it. can anyone please provide pyspark/spark sql code for that?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI