Databricks

CleverAnjos · ‎01-17-2022

What would be the best way of loading several files like in a single table to be consumed?

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csvhttps://s3.amazonaws.com/nyc-t...

CleverAnjos · ‎01-31-2022

Yes,

1) Downloaded the files using sh from here https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_<year>-<month>.csv to /mnt

2) Loaded a dataframe with the csv files

3) Stored as a partitioned table

I don´t know if this the best approach, but its working

View solution in original post

Kaniz · ‎01-18-2022

Hi @Clever Anjos , Here's how you can load several multiple files into Azure Data Factory.

https://www.mssqltips.com/sqlservertip/6281/how-to-load-multiple-files-in-parallel-in-azure-data-fac...

Hubert-Dudek · ‎01-18-2022

New Your Taxi data from your example is already included in your workspace as it is demo dataset.

It is enough to read "yellow" folder and it will read all csvs from there.

If you want to save it as a single file you can do .repartition(1).write.csv(destination_folder).save()

CleverAnjos · ‎01-18-2022

Great!

CleverAnjos · ‎01-19-2022

Unfortunately it seems that nytaxi is outdated. there is no records from 2021 and 2020 and 2019 is barely uncomplete

+-----------+------------------+

| 2010| 169001154|

| 2011| 176897208|

| 2015| 146112990|

| 2014| 165114361|

| 2013| 173179759|

| 2012| 178544324|

| 2009| 170896987|

| 2016| 131165043|

| 2017| 113496933|

| 2018| 102803387|

| 2041| 3|

| 2008| 585|

| 2001| 15|

| 2029| 6|

| 2002| 33|

| 2053| 2|

| 2003| 23|

| 2020| 438|

| 2019| 84397753|

| 2037| 1|

+-----------+------------------+

Kaniz · ‎01-31-2022

Hi @Clever Anjos , You may find the NYC TLC trip records here.

CleverAnjos · ‎01-31-2022

Thanks Kaniz, I already have the files. I was discussing about the best way to load them

Kaniz · ‎01-31-2022

Hi @Clever Anjos , Were you able to load 'em?

CleverAnjos · ‎01-31-2022

Yes,

1) Downloaded the files using sh from here https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_<year>-<month>.csv to /mnt

2) Loaded a dataframe with the csv files

3) Stored as a partitioned table

I don´t know if this the best approach, but its working

Databricks

Best way of loading several csv files in a table

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!