cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Where does the files downloaded from wget get stored in Databricks?

RiyazAli
Contributor III

Hey Team!

All I'm trying is to download a csv file stored on S3 and read it using Spark.

Here's what I mean:

!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv

If i download this "yellow_tripdata_2020-01.csv" where exactly it would be stored?

The response to wget is as below:

--2022-01-04 12:38:48--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.193.8
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.193.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 593610736 (566M) [text/csv]
Saving to: ‘yellow_tripdata_2020-01.csv’
 
yellow_tripdata_202 100%[===================>] 566.11M  14.9MB/s    in 42s     
 
2022-01-04 12:39:31 (13.5 MB/s) - ‘yellow_tripdata_2020-01.csv’ saved [593610736/593610736]

Any help would be appreciated.

Tagging

@Kaniz Fatma​ , @Harikrishnan Kunhumveettil​  for better reach.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

I would prefer to use python requests library to have total control and save it to dbfs storage.

If you run wget you can run with magic command in notebook cell:

%sh

wget...

so you can check current directory with

%sh

pwd

regarding wget it is also possible to specify output file https://linux.die.net/man/1/wget

View solution in original post

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

I would prefer to use python requests library to have total control and save it to dbfs storage.

If you run wget you can run with magic command in notebook cell:

%sh

wget...

so you can check current directory with

%sh

pwd

regarding wget it is also possible to specify output file https://linux.die.net/man/1/wget

Kaniz
Community Manager
Community Manager

Hi @Riyaz Ali​  , Does @Hubert Dudek​ 's reply answer your question?

RiyazAli
Contributor III

Hi @Kaniz Fatma​ , thanks for the remainder.

Hey @Hubert Dudek​ - thank you very much for your prompt response.

Initially, I was using urllib3 to 'GET' the data residing in the URL. So, I wanted an alternative for the same. Unfortunately, requests library does the same thing as urllib3.

The question I had was if I use the wget command, where does the downloaded data gets stored ?

I understood that it would be saved in the driver's memory.

In my case :

'/databricks/driver'

Once, I figured that out, as Hubert suggested, I saved the data in DBFS.

dbutils.fs.cp('file:/databricks/driver/yellow_tripdata_2020-01.csv', 'dbfs:/FileStore/tables/')

Thank y'all for the quick turn around.

Kaniz
Community Manager
Community Manager

Hi @Riyaz Ali​ , If that solves your query would you mind marking it as the best answer?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.