cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

unzip twice the same file not executing

RantoB
Valued Contributor

Hi, 

I need to unzip some files that are ingested but when I unzip twice the same zipped file, the unzip command does not execute :

As suggesgted in the documentation I did :

import urllib 
urllib.request.urlretrieve("https://resources.lendingclub.com/LoanStats3a.csv.zip", "/tmp/LoanStats3a.csv.zip")
%sh
unzip /tmp/LoanStats3a.csv.zip

but when it apply again unzip, command never execute and seems to be blocked in a no out loop.

Thanks for you help.

19 REPLIES 19

Atanu
Esteemed Contributor
Esteemed Contributor
  • Could you please try without community edition, there must be some restriction for %sh
  •  

Prabakar
Esteemed Contributor III
Esteemed Contributor III

Hi @Bertrand BURCKER​ as you have mentioned your zip file is large, can you let us know the size of the file?

Also, have you tried with a smaller zip file, and what is the result?

RantoB
Valued Contributor

My file is 180MiB. For information, the culster is a single node standard_F4s

Hubert-Dudek
Esteemed Contributor III

Another problem is that dbfs storage doesn't support random writes (used by zip):

Does not support random writes. For workloads that require random writes, perform the operations on local disk first and then copy the result to

/dbfs

source: https://docs.databricks.com/data/databricks-file-system.html#local-file-api-limitations

Kaniz
Community Manager
Community Manager

Hi @Bertrand BURCKER​ ,

Create a script.sh and copy the script in the directory where is data.zip archive. This script is working with any name of archives and any name of csv.

#!/bin/bash
 
currLoc="$PWD"
path="${currLoc}"
 
cd ${currLoc}
 
#EXTRACT THE FIRST ARCHIVE IN A TEMP DIRECTORY
 
for filename in $path/*;
do  
    extension="${filename##*.}"
    if [ "${extension}" == 'zip' ]; then
        unzip $filename -d $path/temp
    fi
 
done
 
count=0
 
for filename in $path/temp/*;
do
    extension="${filename##*.}"         #EXTRACT THE EXTENSION TO COMPAIR IF IS AN ARCHIVE OR NOT
    name=${filename##*/}                #EXTRACT THE NAME OF ZIP FILE WITH EXTENSION
    name=${name%.*}                     #EXTRACT THE NAME OF ZIP FILE WITHOUT EXTENSION
    if [ "${extension}" == 'zip' ]; then
        ((count++))
        unzip $filename -d $path/temp/$count
        for file in $path/temp/$count/*
        do
            ext="${file##*.}"   
            if [ "${ext}" == 'csv' ]; then
                csvFileName=${file##*/}
                mv $path/temp/$count/$csvFileName $path/$name-$csvFileName
            fi
        done
    fi
done
 
#REMOVE THE TEMP DIRECTORY
rm -r $path/temp

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.