cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

unzip twice the same file not executing

RantoB
Valued Contributor

Hi, 

I need to unzip some files that are ingested but when I unzip twice the same zipped file, the unzip command does not execute :

As suggesgted in the documentation I did :

import urllib 
urllib.request.urlretrieve("https://resources.lendingclub.com/LoanStats3a.csv.zip", "/tmp/LoanStats3a.csv.zip")
%sh
unzip /tmp/LoanStats3a.csv.zip

but when it apply again unzip, command never execute and seems to be blocked in a no out loop.

Thanks for you help.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

Another problem is that dbfs storage doesn't support random writes (used by zip):

Does not support random writes. For workloads that require random writes, perform the operations on local disk first and then copy the result to

/dbfs

source: https://docs.databricks.com/data/databricks-file-system.html#local-file-api-limitations

View solution in original post

19 REPLIES 19

Kaniz
Community Manager
Community Manager

Hi @ RantoB! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

RantoB
Valued Contributor

Ok.

Not that I have the same behaviour when I'm usig the python api :

with zipfile.ZipFile(path, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

Anonymous
Not applicable

If you're going to be reading the files with Spark, you don't need to unzip them. Spark's CSV reader can read zipped or unzipped CSVs.

If you're going to be using URL retrieve, remember that it will put the files on the driver and not in DBFS so you'll have to move it into the distributed filesystem to use Spark to read them.

RantoB
Valued Contributor

Actually, I will not be reading the file with SPark at this stage and I not using URL retrieve either, that was just for the reproductible example.

Zipped files are ingested on ADLS Gen2 and I unzip them into distinct directories depending on their names. But when I execute my script a second time, I am facing the problem I described above.

Anonymous
Not applicable

So when you use %sh it's going to use the file system on the driver, which is temporary. The driver storage is the local disc on a VM, not ADL2.

RantoB
Valued Contributor

Yes I understand but whatever I do with the unzipped files, I am asking why is there a problem executing twice the unzip action ?

Hubert-Dudek
Esteemed Contributor III
  1. Community edition can have blocked saving to file system or executing %sh
  2. In other editions please verify that file there is. It can be saved for example to dbfs folder.

Following command could help:

dbutils.fs.ls("dbfs:/tmp/")
%sh
ls /dbfs/tmp

you can also consider to adjust your script to use dbfs prefix for example:

    import urllib 
    urllib.request.urlretrieve("https://resources.lendingclub.com/LoanStats3a.csv.zip", "/dbfs/tmp/LoanStats3a.csv.zip")

RantoB
Valued Contributor

I am trying to run :

with zipfile.ZipFile(path, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

but I think I am facing some issues because my zip file is quite large.

hi @Bertrand BURCKER​ ,

Are you still having this issue or you were able to solve it? Please let us know.

Atanu
Esteemed Contributor
Esteemed Contributor

Could you please try without community edition, there must be some restriction for %sh

RantoB
Valued Contributor

No, still not solved.

Hubert-Dudek
Esteemed Contributor III

what is you databricks version Azure or free community edition?

Have you try the examples from this article? link

I would like my code to be fully wwritten in python if possible.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.