- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-04-2022 05:26 AM
Hey Team!
All I'm trying is to download a csv file stored on S3 and read it using Spark.
Here's what I mean:
!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csvIf i download this "yellow_tripdata_2020-01.csv" where exactly it would be stored?
The response to wget is as below:
--2022-01-04 12:38:48-- https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.193.8
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.193.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 593610736 (566M) [text/csv]
Saving to: ‘yellow_tripdata_2020-01.csv’
yellow_tripdata_202 100%[===================>] 566.11M 14.9MB/s in 42s
2022-01-04 12:39:31 (13.5 MB/s) - ‘yellow_tripdata_2020-01.csv’ saved [593610736/593610736]Any help would be appreciated.
Tagging
@Kaniz Fatma , @Harikrishnan Kunhumveettil for better reach.
- Labels:
-
Data Ingestion & connectivity
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-04-2022 07:21 AM
I would prefer to use python requests library to have total control and save it to dbfs storage.
If you run wget you can run with magic command in notebook cell:
%sh
wget...
so you can check current directory with
%sh
pwd
regarding wget it is also possible to specify output file https://linux.die.net/man/1/wget
My blog: https://databrickster.medium.com/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-10-2022 10:56 PM
Hi @Kaniz Fatma , thanks for the remainder.
Hey @Hubert Dudek - thank you very much for your prompt response.
Initially, I was using urllib3 to 'GET' the data residing in the URL. So, I wanted an alternative for the same. Unfortunately, requests library does the same thing as urllib3.
The question I had was if I use the wget command, where does the downloaded data gets stored ?
I understood that it would be saved in the driver's memory.
In my case :
'/databricks/driver'Once, I figured that out, as Hubert suggested, I saved the data in DBFS.
dbutils.fs.cp('file:/databricks/driver/yellow_tripdata_2020-01.csv', 'dbfs:/FileStore/tables/')Thank y'all for the quick turn around.