topic Expand and read Zip compressed files not working in Data Engineering

Expand and read Zip compressed files not working

MrDataMan — Thu, 07 Dec 2023 06:25:25 GMT

I am trying to unzip compressed files following this doc (https://docs.databricks.com/en/files/unzip-files.html) but I am getting the error.

When I run:

dbutils.fs.mv("file:/LoanStats3a.csv", "dbfs:/tmp/LoanStats3a.csv")

I get the following error:

java.io.FileNotFoundException: File file:/LoanStats3a.csv does not exist

Where is the unzipped csv file being saved? I also tried "file:/tmp/file:/LoanStats3a.csv" as the location but that did not work either.

Re: Expand and read Zip compressed files not working

MrDataMan — Mon, 11 Dec 2023 01:52:47 GMT

Didn't solve the issue with this example but I figured out how to specify the location where the unzipped files are saved using an unzipping library.

I used gunzip to unzip my own gzip files like this:

for file in "$SOURCE_DIR"/*.gz; do

echo "Unzipping $file..."

gunzip -c "$file" > "$TARGET_DIR/$(basename "$file" .gz)"

rm "$file" # Delete the original .gz file

done

Re: Expand and read Zip compressed files not working

gabsylvain — Mon, 11 Dec 2023 15:40:24 GMT

Hey @MrDataMan,

I wasn't able to reproduce the exact same error you did get, but I still got a similar error while trying to run the example. To solve it, I tweaked the code a little bit:

%sh curl https://resources.lendingclub.com/LoanStats3a.csv.zip --output /dbfs/tmp/LoanStats3a.csv.zip unzip /dbfs/tmp/LoanStats3a.csv.zip -d /dbfs/tmp/

As you can see, I have changed the output location of the curl command and I have specified the destination of the unzip command so that both point to DBFS instead of the root tmp/ directory.

Then we can read it using Spark:

df = spark.read.format("csv").option("skipRows", 1).option("header", True).load("dbfs:/tmp/LoanStats3a.csv") display(df)

Note: Access to DBFS is required for this example.

Thanks,

Gab