Databricks Community

RiyazAliM · ‎01-04-2022

Hey Team!

All I'm trying is to download a csv file stored on S3 and read it using Spark.

Here's what I mean:

!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv

If i download this "yellow_tripdata_2020-01.csv" where exactly it would be stored?

The response to wget is as below:

--2022-01-04 12:38:48--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.193.8
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.193.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 593610736 (566M) [text/csv]
Saving to: ‘yellow_tripdata_2020-01.csv’
 
yellow_tripdata_202 100%[===================>] 566.11M  14.9MB/s    in 42s     
 
2022-01-04 12:39:31 (13.5 MB/s) - ‘yellow_tripdata_2020-01.csv’ saved [593610736/593610736]

Any help would be appreciated.

Tagging

@Kaniz Fatma , @Harikrishnan Kunhumveettil for better reach.

Riz

Hubert-Dudek · ‎01-04-2022

I would prefer to use python requests library to have total control and save it to dbfs storage.

If you run wget you can run with magic command in notebook cell:

%sh

wget...

so you can check current directory with

%sh

pwd

regarding wget it is also possible to specify output file https://linux.die.net/man/1/wget

My blog: https://databrickster.medium.com/

View solution in original post

Hubert-Dudek · ‎01-04-2022

I would prefer to use python requests library to have total control and save it to dbfs storage.

If you run wget you can run with magic command in notebook cell:

%sh

wget...

so you can check current directory with

%sh

pwd

regarding wget it is also possible to specify output file https://linux.die.net/man/1/wget

My blog: https://databrickster.medium.com/

RiyazAliM · ‎01-10-2022

Hi @Kaniz Fatma , thanks for the remainder.

Hey @Hubert Dudek - thank you very much for your prompt response.

Initially, I was using urllib3 to 'GET' the data residing in the URL. So, I wanted an alternative for the same. Unfortunately, requests library does the same thing as urllib3.

The question I had was if I use the wget command, where does the downloaded data gets stored ?

I understood that it would be saved in the driver's memory.

In my case :

'/databricks/driver'

Once, I figured that out, as Hubert suggested, I saved the data in DBFS.

dbutils.fs.cp('file:/databricks/driver/yellow_tripdata_2020-01.csv', 'dbfs:/FileStore/tables/')

Thank y'all for the quick turn around.

Riz

Databricks Community

Where does the files downloaded from wget get stored in Databricks?

Congratulations Databricks Partners! You're Now Officially Recognized in the Databricks Community

Solution Accelerator Series | Measure Ad Effectiveness With Multi-Touch Attribution

Govern AI Spend at Scale: A Data-Driven Approach to AI Governance | Webinar

Databricks AMER Learning Festival | Virtual Training

Introducing the Genie Hub: Ask Questions, Share Builds, and Master Conversational Analytics