cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Merge 12 CSV files in Databricks.

AleksandraFrolo
New Contributor III

Hello everybody,

I am absolutely new in Databricks, so I need your help.

Details:

Task: merge 12 CSV files in Databricks with the best way.

Location of files: I will describe it in details, because I can not good orientate yet. If i go to Data -> Browse DBFS -> i can find folder with my 12 csv files.

What I have tried:

Firstly I need to say that I've reached the correct result, but I think it was really bad approach.

  • Create spark object. The object will help to read data from csv files.
spark = SparkSession.builder.getOrCreate()

  • Save read csv into variables.
df_April = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_April_2019.csv")
df_August = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_August_2019.csv")
df_December = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_December_2019.csv")
df_February = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_February_2019.csv")
df_January = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_January_2019.csv")
df_July = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_July_2019.csv")
df_June = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_June_2019.csv")
df_March = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_March_2019.csv")
df_May = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_May_2019.csv")
df_November = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_November_2019.csv")
df_October = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_October_2019.csv")
df_September = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_September_2019.csv")

  • Use Union() method to combine data. Data structrure for this method should be the same. Union() method returns a set that contains all items from the original set, and all items from the specified set/s.
df_AllMonth = df_April.union(df_August).union(df_December).union(df_February).union(df_January).union(df_July).union(df_June).union(df_March).union(df_May).union(df_November).union(df_October).union(df_September)

Conclusion:

I want to find an approach where I can merge data without saving it into variables. Is it possible? Maybe you can find better way how to do this task?

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions

Hi, thank you for your answer! Yeah all structures of my csv files are the same.

I used method listdir() to get all names of the files and with "for cykle" I am reading my paths and csv files, and save it into new dataframe.

Important: Actually, if I write "dbfs:/...." it doesn't work (I always get error like file isn't found), but when I use "/dbfs/" it works idk why😥

Anyway this is correct code to read all csv files and concatenate it.

folder_path = "/dbfs/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/"
 
iles = os.listdir(folder_path) #returns list of all names of csv files in defined folder
 
df_all_months = pd.DataFrame() #create new DataFrame object
 
for file in files:
    df_of_single_file = pd.read_csv(folder_path + file) #store current dataframe
    df_all_months = pd.concat([df_all_months, df_of_single_file])

View solution in original post

5 REPLIES 5

-werners-
Esteemed Contributor III

Ok some tips:

  • you do not have to create a spark session on databricks, it is already created by databricks. But your getOrCreate does not break anything

spark can read whole folders at once. If you have 12 csv files in one folder, AND THEY HAVE THE SAME SCHEMA, you can try: spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/)

Like that the whole folder is read.

Of course, if your files have a different structure, spark does not know what to do obviously so you will have to define a schema manually.

Hi, thank you for your answer! Yeah all structures of my csv files are the same.

I used method listdir() to get all names of the files and with "for cykle" I am reading my paths and csv files, and save it into new dataframe.

Important: Actually, if I write "dbfs:/...." it doesn't work (I always get error like file isn't found), but when I use "/dbfs/" it works idk why😥

Anyway this is correct code to read all csv files and concatenate it.

folder_path = "/dbfs/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/"
 
iles = os.listdir(folder_path) #returns list of all names of csv files in defined folder
 
df_all_months = pd.DataFrame() #create new DataFrame object
 
for file in files:
    df_of_single_file = pd.read_csv(folder_path + file) #store current dataframe
    df_all_months = pd.concat([df_all_months, df_of_single_file])

-werners-
Esteemed Contributor III

Hi,

this solution will work indeed but is far from optimal.

You can read a whole folder at once instead of using a loop.

The difference between dbfs:/ and /dbfs/ is just the type of file interface.

/dbfs/ is used by spark, so that is the reason it works in spark.

Lakshay
Databricks Employee
Databricks Employee

It seems that all your csv files are present under one folder and since you are able to union them, all these files must have same schema as well.

Given the above conditions, you can simply read all the data by referring the folder name instead of referring to each file individually.

Hello, thank you for answer! Yeah that is true, schema of all my csv files is the same and they all are located in one folder. I posted a solution above your message.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group