โ12-12-2022 11:41 PM
โ12-13-2022 03:55 AM
Why can't you unzip it? You can not read zipped files with spark as zip isn't a file type. https://docs.databricks.com/files/unzip-files.html has some instructions on how to unzip them and read them.
โ12-13-2022 02:07 PM
Additionally, if you don't want or can't unzip whole archive, you can list the contents of the archive and unzip only selected file.
Still, as @Joseph Kambourakisโ asked - why can't you just unzip it? What's blocking you?
โ12-14-2022 07:57 AM
We encountered a similar issue, but for gzip files. If you can convert your files to gzip instead of ZIP, it is as easy as the following (in PySpark)
df = spark.read.option("header", "true").csv(PATH + "/*.csv.gz")
As best as I can tell, this is not possible with ZIP files, but if you have a place where you can write the output to, writing a Python or Scala script to unzip and then gzip the files should not be too hard [if keeping them compressed is required, else do what @Joseph Kambourakisโ said and just unzip them ๐ ]
โ12-14-2022 08:29 AM
Great you pointed out @Ben Elbertโ that spark allows to read compressed files (`compression` property mentioned here: https://spark.apache.org/docs/latest/sql-data-sources-csv.html). Still, it won't work with .zip archive.
โ12-14-2022 01:53 PM
one more solution - you can read .zip using old good pandas `read_csv` method (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)
import pandas as pd
simple_csv_zipped = pd.read_csv("/dbfs/FileStore/simple_file.zip")
still there is one disclaimer: "If using โzipโ or โtarโ, the ZIP file must contain only one data file to be read in."
and there is also obvious trade-off: using pandas means no distribution, no scalability and exposure to OOM errors - but maybe in your specific case it is acceptable
โ12-16-2022 06:42 PM
@Jog Giriโ I also recently encountered a similar scenario, the below code solved my purpose without any issues.
import zipfile
for i in dbutils.fs.ls('/mnt/zipfilespath/'):
with zipfile.ZipFile(i.path.replace('dbfs:','/dbfs'), mode="r") as zip_ref:
zip_ref.extractall(destination_path)
where I mounted an ADLS Gen 2 container which consists of several .csv zip files, please let me know if you face any further issues, happy to help!!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group