Expand and read Zip compressed files not working

MrDataMan
New Contributor II

I am trying to unzip compressed files following this doc (https://docs.databricks.com/en/files/unzip-files.html) but I am getting the error.

When I run:

dbutils.fs.mv("file:/LoanStats3a.csv", "dbfs:/tmp/LoanStats3a.csv") 

I get the following error: 

java.io.FileNotFoundException: File file:/LoanStats3a.csv does not exist

Where is the unzipped csv file being saved? I also tried "file:/tmp/file:/LoanStats3a.csv" as the location but that did not work either.

MrDataMan
New Contributor II

Didn't solve the issue with this example but I figured out how to specify the location where the unzipped files are saved using an unzipping library.

I used gunzip to unzip my own gzip files like this:

 

for file in "$SOURCE_DIR"/*.gz; do
echo "Unzipping $file..."
gunzip -c "$file" > "$TARGET_DIR/$(basename "$file" .gz)"
rm "$file" # Delete the original .gz file
done

gabsylvain
Databricks Employee
Databricks Employee

Hey @MrDataMan,

I wasn't able to reproduce the exact same error you did get, but I still got a similar error while trying to run the example. To solve it, I tweaked the code a little bit:

 

%sh curl https://resources.lendingclub.com/LoanStats3a.csv.zip --output /dbfs/tmp/LoanStats3a.csv.zip
unzip /dbfs/tmp/LoanStats3a.csv.zip -d /dbfs/tmp/

 

As you can see, I have changed the output location of the curl command and I have specified the destination of the unzip command so that both point to DBFS instead of the root tmp/ directory.

Then we can read it using Spark: 

 

df = spark.read.format("csv").option("skipRows", 1).option("header", True).load("dbfs:/tmp/LoanStats3a.csv")
display(df)

 

 

Note: Access to DBFS is required for this example.

 Thanks,

Gab