cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best way of loading several csv files in a table

CleverAnjos
New Contributor III

What would be the best way of loading several files like in a single table to be consumed?

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csvhttps://s3.amazonaws.com/nyc-t...

1 ACCEPTED SOLUTION

Accepted Solutions

Yes,

1) Downloaded the files using sh from here https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_<year>-<month>.csv to /mnt

2) Loaded a dataframe with the csv files

3) Stored as a partitioned table

I don´t know if this the best approach, but its working

View solution in original post

8 REPLIES 8

Kaniz_Fatma
Community Manager
Community Manager

Hi @Clever Anjos​ , Here's how you can load several multiple files into Azure Data Factory.

https://www.mssqltips.com/sqlservertip/6281/how-to-load-multiple-files-in-parallel-in-azure-data-fac...

Hubert-Dudek
Esteemed Contributor III

New Your Taxi data from your example is already included in your workspace as it is demo dataset.

It is enough to read "yellow" folder and it will read all csvs from there.

If you want to save it as a single file you can do .repartition(1).write.csv(destination_folder).save()

image.png

Great!

Unfortunately it seems that nytaxi is outdated. there is no records from 2021 and 2020 and 2019 is barely uncomplete

+-----------+------------------+

| 2010| 169001154|

| 2011| 176897208|

| 2015| 146112990|

| 2014| 165114361|

| 2013| 173179759|

| 2012| 178544324|

| 2009| 170896987|

| 2016| 131165043|

| 2017| 113496933|

| 2018| 102803387|

| 2041| 3|

| 2008| 585|

| 2001| 15|

| 2029| 6|

| 2002| 33|

| 2053| 2|

| 2003| 23|

| 2020| 438|

| 2019| 84397753|

| 2037| 1|

+-----------+------------------+

Hi @Clever Anjos​ , You may find the NYC TLC trip records here.

CleverAnjos
New Contributor III

Thanks Kaniz, I already have the files. I was discussing about the best way to load them

Hi @Clever Anjos​ , Were you able to load 'em?

Yes,

1) Downloaded the files using sh from here https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_<year>-<month>.csv to /mnt

2) Loaded a dataframe with the csv files

3) Stored as a partitioned table

I don´t know if this the best approach, but its working

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!