cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

unzip twice the same file not executing

RantoB
Valued Contributor

Hi, 

I need to unzip some files that are ingested but when I unzip twice the same zipped file, the unzip command does not execute :

As suggesgted in the documentation I did :

import urllib 
urllib.request.urlretrieve("https://resources.lendingclub.com/LoanStats3a.csv.zip", "/tmp/LoanStats3a.csv.zip")
%sh
unzip /tmp/LoanStats3a.csv.zip

but when it apply again unzip, command never execute and seems to be blocked in a no out loop.

Thanks for you help.

17 REPLIES 17

Prabakar
Databricks Employee
Databricks Employee

Hi @Bertrand BURCKER​ as you have mentioned your zip file is large, can you let us know the size of the file?

Also, have you tried with a smaller zip file, and what is the result?

RantoB
Valued Contributor

My file is 180MiB. For information, the culster is a single node standard_F4s

Hubert-Dudek
Esteemed Contributor III

Another problem is that dbfs storage doesn't support random writes (used by zip):

Does not support random writes. For workloads that require random writes, perform the operations on local disk first and then copy the result to

/dbfs

source: https://docs.databricks.com/data/databricks-file-system.html#local-file-api-limitations

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group