Databricks Community

AleksandraFrolo · ‎06-06-2023

Hello everybody,

I am absolutely new in Databricks, so I need your help.

Details:

Task: merge 12 CSV files in Databricks with the best way.

Location of files: I will describe it in details, because I can not good orientate yet. If i go to Data -> Browse DBFS -> i can find folder with my 12 csv files.

What I have tried:

Firstly I need to say that I've reached the correct result, but I think it was really bad approach.

Create spark object. The object will help to read data from csv files.

spark = SparkSession.builder.getOrCreate()

Save read csv into variables.

df_April = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_April_2019.csv")
df_August = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_August_2019.csv")
df_December = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_December_2019.csv")
df_February = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_February_2019.csv")
df_January = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_January_2019.csv")
df_July = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_July_2019.csv")
df_June = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_June_2019.csv")
df_March = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_March_2019.csv")
df_May = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_May_2019.csv")
df_November = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_November_2019.csv")
df_October = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_October_2019.csv")
df_September = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_September_2019.csv")

Use Union() method to combine data. Data structrure for this method should be the same. Union() method returns a set that contains all items from the original set, and all items from the specified set/s.

df_AllMonth = df_April.union(df_August).union(df_December).union(df_February).union(df_January).union(df_July).union(df_June).union(df_March).union(df_May).union(df_November).union(df_October).union(df_September)

Conclusion:

I want to find an approach where I can merge data without saving it into variables. Is it possible? Maybe you can find better way how to do this task?

Thank you!

AleksandraFrolo · ‎06-07-2023

Hi, thank you for your answer! Yeah all structures of my csv files are the same.

I used method listdir() to get all names of the files and with "for cykle" I am reading my paths and csv files, and save it into new dataframe.

Important: Actually, if I write "dbfs:/...." it doesn't work (I always get error like file isn't found), but when I use "/dbfs/" it works idk why😥

Anyway this is correct code to read all csv files and concatenate it.

folder_path = "/dbfs/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/"
 
iles = os.listdir(folder_path) #returns list of all names of csv files in defined folder
 
df_all_months = pd.DataFrame() #create new DataFrame object
 
for file in files:
    df_of_single_file = pd.read_csv(folder_path + file) #store current dataframe
    df_all_months = pd.concat([df_all_months, df_of_single_file])

View solution in original post

-werners- · ‎06-07-2023

Ok some tips:

you do not have to create a spark session on databricks, it is already created by databricks. But your getOrCreate does not break anything

spark can read whole folders at once. If you have 12 csv files in one folder, AND THEY HAVE THE SAME SCHEMA, you can try: spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/)

Like that the whole folder is read.

Of course, if your files have a different structure, spark does not know what to do obviously so you will have to define a schema manually.

AleksandraFrolo · ‎06-07-2023

Hi, thank you for your answer! Yeah all structures of my csv files are the same.

I used method listdir() to get all names of the files and with "for cykle" I am reading my paths and csv files, and save it into new dataframe.

Important: Actually, if I write "dbfs:/...." it doesn't work (I always get error like file isn't found), but when I use "/dbfs/" it works idk why😥

Anyway this is correct code to read all csv files and concatenate it.

folder_path = "/dbfs/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/"
 
iles = os.listdir(folder_path) #returns list of all names of csv files in defined folder
 
df_all_months = pd.DataFrame() #create new DataFrame object
 
for file in files:
    df_of_single_file = pd.read_csv(folder_path + file) #store current dataframe
    df_all_months = pd.concat([df_all_months, df_of_single_file])

-werners- · ‎06-08-2023

Hi,

this solution will work indeed but is far from optimal.

You can read a whole folder at once instead of using a loop.

The difference between dbfs:/ and /dbfs/ is just the type of file interface.

/dbfs/ is used by spark, so that is the reason it works in spark.

Lakshay · ‎06-07-2023

It seems that all your csv files are present under one folder and since you are able to union them, all these files must have same schema as well.

Given the above conditions, you can simply read all the data by referring the folder name instead of referring to each file individually.

AleksandraFrolo · ‎06-07-2023

Hello, thank you for answer! Yeah that is true, schema of all my csv files is the same and they all are located in one folder. I posted a solution above your message.

Databricks Community

Merge 12 CSV files in Databricks.

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences