Databricks Community

RiyazAli · ‎01-04-2022

Hey Team!

All I'm trying is to download a csv file stored on S3 and read it using Spark.

Here's what I mean:

!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv

If i download this "yellow_tripdata_2020-01.csv" where exactly it would be stored?

The response to wget is as below:

--2022-01-04 12:38:48--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.193.8
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.193.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 593610736 (566M) [text/csv]
Saving to: ‘yellow_tripdata_2020-01.csv’
 
yellow_tripdata_202 100%[===================>] 566.11M  14.9MB/s    in 42s     
 
2022-01-04 12:39:31 (13.5 MB/s) - ‘yellow_tripdata_2020-01.csv’ saved [593610736/593610736]

Any help would be appreciated.

Tagging

@Kaniz Fatma , @Harikrishnan Kunhumveettil for better reach.

Riz

Hubert-Dudek · ‎01-04-2022

I would prefer to use python requests library to have total control and save it to dbfs storage.

If you run wget you can run with magic command in notebook cell:

%sh

wget...

so you can check current directory with

%sh

pwd

regarding wget it is also possible to specify output file https://linux.die.net/man/1/wget

View solution in original post

Hubert-Dudek · ‎01-04-2022

I would prefer to use python requests library to have total control and save it to dbfs storage.

If you run wget you can run with magic command in notebook cell:

%sh

wget...

so you can check current directory with

%sh

pwd

regarding wget it is also possible to specify output file https://linux.die.net/man/1/wget

RiyazAli · ‎01-10-2022

Hi @Kaniz Fatma , thanks for the remainder.

Hey @Hubert Dudek - thank you very much for your prompt response.

Initially, I was using urllib3 to 'GET' the data residing in the URL. So, I wanted an alternative for the same. Unfortunately, requests library does the same thing as urllib3.

The question I had was if I use the wget command, where does the downloaded data gets stored ?

I understood that it would be saved in the driver's memory.

In my case :

'/databricks/driver'

Once, I figured that out, as Hubert suggested, I saved the data in DBFS.

dbutils.fs.cp('file:/databricks/driver/yellow_tripdata_2020-01.csv', 'dbfs:/FileStore/tables/')

Thank y'all for the quick turn around.

Riz

Databricks Community

Where does the files downloaded from wget get stored in Databricks?

Connect with Databricks Users in Your Area

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure

Securely share data, analytics and AI