11-02-2021 08:17 AM
Hi,
I need to unzip some files that are ingested but when I unzip twice the same zipped file, the unzip command does not execute :
As suggesgted in the documentation I did :
import urllib
urllib.request.urlretrieve("https://resources.lendingclub.com/LoanStats3a.csv.zip", "/tmp/LoanStats3a.csv.zip")
%sh
unzip /tmp/LoanStats3a.csv.zip
but when it apply again unzip, command never execute and seems to be blocked in a no out loop.
Thanks for you help.
12-15-2021 05:39 AM
Another problem is that dbfs storage doesn't support random writes (used by zip):
Does not support random writes. For workloads that require random writes, perform the operations on local disk first and then copy the result to
/dbfs
source: https://docs.databricks.com/data/databricks-file-system.html#local-file-api-limitations
11-02-2021 08:24 AM
Ok.
Not that I have the same behaviour when I'm usig the python api :
with zipfile.ZipFile(path, 'r') as zip_ref:
zip_ref.extractall(directory_to_extract_to)
11-02-2021 08:47 AM
If you're going to be reading the files with Spark, you don't need to unzip them. Spark's CSV reader can read zipped or unzipped CSVs.
If you're going to be using URL retrieve, remember that it will put the files on the driver and not in DBFS so you'll have to move it into the distributed filesystem to use Spark to read them.
11-02-2021 08:54 AM
Actually, I will not be reading the file with SPark at this stage and I not using URL retrieve either, that was just for the reproductible example.
Zipped files are ingested on ADLS Gen2 and I unzip them into distinct directories depending on their names. But when I execute my script a second time, I am facing the problem I described above.
11-02-2021 08:58 AM
So when you use %sh it's going to use the file system on the driver, which is temporary. The driver storage is the local disc on a VM, not ADL2.
11-02-2021 09:02 AM
Yes I understand but whatever I do with the unzipped files, I am asking why is there a problem executing twice the unzip action ?
11-02-2021 09:25 AM
Following command could help:
dbutils.fs.ls("dbfs:/tmp/")
%sh
ls /dbfs/tmp
you can also consider to adjust your script to use dbfs prefix for example:
import urllib
urllib.request.urlretrieve("https://resources.lendingclub.com/LoanStats3a.csv.zip", "/dbfs/tmp/LoanStats3a.csv.zip")
11-02-2021 10:53 AM
I am trying to run :
with zipfile.ZipFile(path, 'r') as zip_ref:
zip_ref.extractall(directory_to_extract_to)
but I think I am facing some issues because my zip file is quite large.
11-15-2021 04:23 PM
hi @Bertrand BURCKER ,
Are you still having this issue or you were able to solve it? Please let us know.
11-29-2021 10:54 PM
Could you please try without community edition, there must be some restriction for %sh
11-16-2021 12:32 AM
No, still not solved.
11-16-2021 10:04 AM
what is you databricks version Azure or free community edition?
11-22-2021 02:45 PM
Have you try the examples from this article? link
11-26-2021 10:26 AM
I would like my code to be fully wwritten in python if possible.
11-29-2021 10:54 PM
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group