cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Best way of loading several csv files in a table

CleverAnjos
New Contributor III

What would be the best way of loading several files like in a single table to be consumed?

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csvhttps://s3.amazonaws.com/nyc-t...

1 ACCEPTED SOLUTION

Accepted Solutions

CleverAnjos
New Contributor III

Yes,

1) Downloaded the files using sh from here https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_<year>-<month>.csv to /mnt

2) Loaded a dataframe with the csv files

3) Stored as a partitioned table

I don´t know if this the best approach, but its working

View solution in original post

8 REPLIES 8

Kaniz
Community Manager
Community Manager

Hi @Clever Anjos​ , Here's how you can load several multiple files into Azure Data Factory.

https://www.mssqltips.com/sqlservertip/6281/how-to-load-multiple-files-in-parallel-in-azure-data-fac...

Hubert-Dudek
Esteemed Contributor III

New Your Taxi data from your example is already included in your workspace as it is demo dataset.

It is enough to read "yellow" folder and it will read all csvs from there.

If you want to save it as a single file you can do .repartition(1).write.csv(destination_folder).save()

image.png

Great!

Unfortunately it seems that nytaxi is outdated. there is no records from 2021 and 2020 and 2019 is barely uncomplete

+-----------+------------------+

| 2010| 169001154|

| 2011| 176897208|

| 2015| 146112990|

| 2014| 165114361|

| 2013| 173179759|

| 2012| 178544324|

| 2009| 170896987|

| 2016| 131165043|

| 2017| 113496933|

| 2018| 102803387|

| 2041| 3|

| 2008| 585|

| 2001| 15|

| 2029| 6|

| 2002| 33|

| 2053| 2|

| 2003| 23|

| 2020| 438|

| 2019| 84397753|

| 2037| 1|

+-----------+------------------+

Kaniz
Community Manager
Community Manager

Hi @Clever Anjos​ , You may find the NYC TLC trip records here.

CleverAnjos
New Contributor III

Thanks Kaniz, I already have the files. I was discussing about the best way to load them

Kaniz
Community Manager
Community Manager

Hi @Clever Anjos​ , Were you able to load 'em?

CleverAnjos
New Contributor III

Yes,

1) Downloaded the files using sh from here https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_<year>-<month>.csv to /mnt

2) Loaded a dataframe with the csv files

3) Stored as a partitioned table

I don´t know if this the best approach, but its working

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.