Where does the files downloaded from wget get stor...

RiyazAliM · ‎01-04-2022

Hey Team!

All I'm trying is to download a csv file stored on S3 and read it using Spark.

Here's what I mean:

!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv

If i download this "yellow_tripdata_2020-01.csv" where exactly it would be stored?

The response to wget is as below:

--2022-01-04 12:38:48--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.193.8
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.193.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 593610736 (566M) [text/csv]
Saving to: ‘yellow_tripdata_2020-01.csv’
 
yellow_tripdata_202 100%[===================>] 566.11M  14.9MB/s    in 42s     
 
2022-01-04 12:39:31 (13.5 MB/s) - ‘yellow_tripdata_2020-01.csv’ saved [593610736/593610736]

Any help would be appreciated.

Tagging

@Kaniz Fatma , @Harikrishnan Kunhumveettil for better reach.

Riz

Hubert-Dudek · ‎01-04-2022

I would prefer to use python requests library to have total control and save it to dbfs storage.

If you run wget you can run with magic command in notebook cell:

%sh

wget...

so you can check current directory with

%sh

pwd

regarding wget it is also possible to specify output file https://linux.die.net/man/1/wget

My blog: https://databrickster.medium.com/

View solution in original post

RiyazAliM · ‎01-10-2022

Hi @Kaniz Fatma , thanks for the remainder.

Hey @Hubert Dudek - thank you very much for your prompt response.

Initially, I was using urllib3 to 'GET' the data residing in the URL. So, I wanted an alternative for the same. Unfortunately, requests library does the same thing as urllib3.

The question I had was if I use the wget command, where does the downloaded data gets stored ?

I understood that it would be saved in the driver's memory.

In my case :

'/databricks/driver'

Once, I figured that out, as Hubert suggested, I saved the data in DBFS.

dbutils.fs.cp('file:/databricks/driver/yellow_tripdata_2020-01.csv', 'dbfs:/FileStore/tables/')

Thank y'all for the quick turn around.

Riz

Where does the files downloaded from wget get stored in Databricks?