cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Where does the files downloaded from wget get stored in Databricks?

RiyazAli
Valued Contributor

Hey Team!

All I'm trying is to download a csv file stored on S3 and read it using Spark.

Here's what I mean:

!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv

If i download this "yellow_tripdata_2020-01.csv" where exactly it would be stored?

The response to wget is as below:

--2022-01-04 12:38:48--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.193.8
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.193.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 593610736 (566M) [text/csv]
Saving to: ‘yellow_tripdata_2020-01.csv’
 
yellow_tripdata_202 100%[===================>] 566.11M  14.9MB/s    in 42s     
 
2022-01-04 12:39:31 (13.5 MB/s) - ‘yellow_tripdata_2020-01.csv’ saved [593610736/593610736]

Any help would be appreciated.

Tagging

@Kaniz Fatma​ , @Harikrishnan Kunhumveettil​  for better reach.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

I would prefer to use python requests library to have total control and save it to dbfs storage.

If you run wget you can run with magic command in notebook cell:

%sh

wget...

so you can check current directory with

%sh

pwd

regarding wget it is also possible to specify output file https://linux.die.net/man/1/wget

View solution in original post

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

I would prefer to use python requests library to have total control and save it to dbfs storage.

If you run wget you can run with magic command in notebook cell:

%sh

wget...

so you can check current directory with

%sh

pwd

regarding wget it is also possible to specify output file https://linux.die.net/man/1/wget

RiyazAli
Valued Contributor

Hi @Kaniz Fatma​ , thanks for the remainder.

Hey @Hubert Dudek​ - thank you very much for your prompt response.

Initially, I was using urllib3 to 'GET' the data residing in the URL. So, I wanted an alternative for the same. Unfortunately, requests library does the same thing as urllib3.

The question I had was if I use the wget command, where does the downloaded data gets stored ?

I understood that it would be saved in the driver's memory.

In my case :

'/databricks/driver'

Once, I figured that out, as Hubert suggested, I saved the data in DBFS.

dbutils.fs.cp('file:/databricks/driver/yellow_tripdata_2020-01.csv', 'dbfs:/FileStore/tables/')

Thank y'all for the quick turn around.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group