cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Merge 12 CSV files in Databricks.

AleksandraFrolo
New Contributor III

Hello everybody,

I am absolutely new in Databricks, so I need your help.

Details:

Task: merge 12 CSV files in Databricks with the best way.

Location of files: I will describe it in details, because I can not good orientate yet. If i go to Data -> Browse DBFS -> i can find folder with my 12 csv files.

What I have tried:

Firstly I need to say that I've reached the correct result, but I think it was really bad approach.

  • Create spark object. The object will help to read data from csv files.
spark = SparkSession.builder.getOrCreate()

  • Save read csv into variables.
df_April = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_April_2019.csv")
df_August = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_August_2019.csv")
df_December = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_December_2019.csv")
df_February = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_February_2019.csv")
df_January = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_January_2019.csv")
df_July = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_July_2019.csv")
df_June = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_June_2019.csv")
df_March = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_March_2019.csv")
df_May = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_May_2019.csv")
df_November = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_November_2019.csv")
df_October = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_October_2019.csv")
df_September = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_September_2019.csv")

  • Use Union() method to combine data. Data structrure for this method should be the same. Union() method returns a set that contains all items from the original set, and all items from the specified set/s.
df_AllMonth = df_April.union(df_August).union(df_December).union(df_February).union(df_January).union(df_July).union(df_June).union(df_March).union(df_May).union(df_November).union(df_October).union(df_September)

Conclusion:

I want to find an approach where I can merge data without saving it into variables. Is it possible? Maybe you can find better way how to do this task?

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions

Hi, thank you for your answer! Yeah all structures of my csv files are the same.

I used method listdir() to get all names of the files and with "for cykle" I am reading my paths and csv files, and save it into new dataframe.

Important: Actually, if I write "dbfs:/...." it doesn't work (I always get error like file isn't found), but when I use "/dbfs/" it works idk why😥

Anyway this is correct code to read all csv files and concatenate it.

folder_path = "/dbfs/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/"
 
iles = os.listdir(folder_path) #returns list of all names of csv files in defined folder
 
df_all_months = pd.DataFrame() #create new DataFrame object
 
for file in files:
    df_of_single_file = pd.read_csv(folder_path + file) #store current dataframe
    df_all_months = pd.concat([df_all_months, df_of_single_file])

View solution in original post

5 REPLIES 5

-werners-
Esteemed Contributor III

Ok some tips:

  • you do not have to create a spark session on databricks, it is already created by databricks. But your getOrCreate does not break anything

spark can read whole folders at once. If you have 12 csv files in one folder, AND THEY HAVE THE SAME SCHEMA, you can try: spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/)

Like that the whole folder is read.

Of course, if your files have a different structure, spark does not know what to do obviously so you will have to define a schema manually.

Hi, thank you for your answer! Yeah all structures of my csv files are the same.

I used method listdir() to get all names of the files and with "for cykle" I am reading my paths and csv files, and save it into new dataframe.

Important: Actually, if I write "dbfs:/...." it doesn't work (I always get error like file isn't found), but when I use "/dbfs/" it works idk why😥

Anyway this is correct code to read all csv files and concatenate it.

folder_path = "/dbfs/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/"
 
iles = os.listdir(folder_path) #returns list of all names of csv files in defined folder
 
df_all_months = pd.DataFrame() #create new DataFrame object
 
for file in files:
    df_of_single_file = pd.read_csv(folder_path + file) #store current dataframe
    df_all_months = pd.concat([df_all_months, df_of_single_file])

-werners-
Esteemed Contributor III

Hi,

this solution will work indeed but is far from optimal.

You can read a whole folder at once instead of using a loop.

The difference between dbfs:/ and /dbfs/ is just the type of file interface.

/dbfs/ is used by spark, so that is the reason it works in spark.

Lakshay
Esteemed Contributor
Esteemed Contributor

It seems that all your csv files are present under one folder and since you are able to union them, all these files must have same schema as well.

Given the above conditions, you can simply read all the data by referring the folder name instead of referring to each file individually.

Hello, thank you for answer! Yeah that is true, schema of all my csv files is the same and they all are located in one folder. I posted a solution above your message.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.