06-06-2023 04:03 AM
Hello everybody,
I am absolutely new in Databricks, so I need your help.
Details:
Task: merge 12 CSV files in Databricks with the best way.
Location of files: I will describe it in details, because I can not good orientate yet. If i go to Data -> Browse DBFS -> i can find folder with my 12 csv files.
What I have tried:
Firstly I need to say that I've reached the correct result, but I think it was really bad approach.
spark = SparkSession.builder.getOrCreate()
df_April = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_April_2019.csv")
df_August = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_August_2019.csv")
df_December = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_December_2019.csv")
df_February = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_February_2019.csv")
df_January = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_January_2019.csv")
df_July = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_July_2019.csv")
df_June = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_June_2019.csv")
df_March = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_March_2019.csv")
df_May = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_May_2019.csv")
df_November = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_November_2019.csv")
df_October = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_October_2019.csv")
df_September = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_September_2019.csv")
df_AllMonth = df_April.union(df_August).union(df_December).union(df_February).union(df_January).union(df_July).union(df_June).union(df_March).union(df_May).union(df_November).union(df_October).union(df_September)
Conclusion:
I want to find an approach where I can merge data without saving it into variables. Is it possible? Maybe you can find better way how to do this task?
Thank you!
06-07-2023 11:33 PM
Hi, thank you for your answer! Yeah all structures of my csv files are the same.
I used method listdir() to get all names of the files and with "for cykle" I am reading my paths and csv files, and save it into new dataframe.
Important: Actually, if I write "dbfs:/...." it doesn't work (I always get error like file isn't found), but when I use "/dbfs/" it works idk why😥
Anyway this is correct code to read all csv files and concatenate it.
folder_path = "/dbfs/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/"
iles = os.listdir(folder_path) #returns list of all names of csv files in defined folder
df_all_months = pd.DataFrame() #create new DataFrame object
for file in files:
df_of_single_file = pd.read_csv(folder_path + file) #store current dataframe
df_all_months = pd.concat([df_all_months, df_of_single_file])
06-07-2023 12:51 AM
Ok some tips:
spark can read whole folders at once. If you have 12 csv files in one folder, AND THEY HAVE THE SAME SCHEMA, you can try: spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/)
Like that the whole folder is read.
Of course, if your files have a different structure, spark does not know what to do obviously so you will have to define a schema manually.
06-07-2023 11:33 PM
Hi, thank you for your answer! Yeah all structures of my csv files are the same.
I used method listdir() to get all names of the files and with "for cykle" I am reading my paths and csv files, and save it into new dataframe.
Important: Actually, if I write "dbfs:/...." it doesn't work (I always get error like file isn't found), but when I use "/dbfs/" it works idk why😥
Anyway this is correct code to read all csv files and concatenate it.
folder_path = "/dbfs/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/"
iles = os.listdir(folder_path) #returns list of all names of csv files in defined folder
df_all_months = pd.DataFrame() #create new DataFrame object
for file in files:
df_of_single_file = pd.read_csv(folder_path + file) #store current dataframe
df_all_months = pd.concat([df_all_months, df_of_single_file])
06-08-2023 12:02 AM
Hi,
this solution will work indeed but is far from optimal.
You can read a whole folder at once instead of using a loop.
The difference between dbfs:/ and /dbfs/ is just the type of file interface.
/dbfs/ is used by spark, so that is the reason it works in spark.
06-07-2023 06:44 AM
It seems that all your csv files are present under one folder and since you are able to union them, all these files must have same schema as well.
Given the above conditions, you can simply read all the data by referring the folder name instead of referring to each file individually.
06-07-2023 11:35 PM
Hello, thank you for answer! Yeah that is true, schema of all my csv files is the same and they all are located in one folder. I posted a solution above your message.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group