<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Merge 12 CSV files in Databricks. in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3554#M149</link>
    <description>&lt;P&gt;Hi, thank you for your answer! Yeah all structures of my csv files are the same.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I used method listdir() to get all names of the files and with "for cykle" I am reading my paths and csv files, and save it  into new dataframe.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Important:&lt;/B&gt; Actually, if I write "dbfs:/...." it doesn't work (I always get error like file isn't found), but when I use "/dbfs/" it works idk why&lt;span class="lia-unicode-emoji" title=":sad_but_relieved_face:"&gt;😥&lt;/span&gt; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Anyway this is correct code to read all csv files and concatenate it.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;folder_path = "/dbfs/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/"
&amp;nbsp;
iles = os.listdir(folder_path) #returns list of all names of csv files in defined folder
&amp;nbsp;
df_all_months = pd.DataFrame() #create new DataFrame object
&amp;nbsp;
for file in files:
    df_of_single_file = pd.read_csv(folder_path + file) #store current dataframe
    df_all_months = pd.concat([df_all_months, df_of_single_file])&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 08 Jun 2023 06:33:05 GMT</pubDate>
    <dc:creator>AleksandraFrolo</dc:creator>
    <dc:date>2023-06-08T06:33:05Z</dc:date>
    <item>
      <title>Merge 12 CSV files in Databricks.</title>
      <link>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3551#M146</link>
      <description>&lt;P&gt;Hello everybody,&lt;/P&gt;&lt;P&gt;I am absolutely new in Databricks, so I need your help.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;Details:&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Task:&lt;/B&gt; merge 12 CSV files in Databricks with the best way.&lt;/P&gt;&lt;P&gt;&lt;B&gt;Location of files:&lt;/B&gt; I will describe it in details, because I can not good orientate yet. If i go to Data -&amp;gt; Browse DBFS -&amp;gt; i can find folder with my 12 csv files.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;What I have tried:&lt;/U&gt;&lt;/P&gt;&lt;P&gt;Firstly I need to say that I've reached the correct result, but I think it was really bad approach.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Create spark object. The object will help to read data from csv files.&lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;&lt;CODE&gt;spark = SparkSession.builder.getOrCreate()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Save read csv into variables. &lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;&lt;CODE&gt;df_April = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_April_2019.csv")
df_August = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_August_2019.csv")
df_December = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_December_2019.csv")
df_February = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_February_2019.csv")
df_January = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_January_2019.csv")
df_July = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_July_2019.csv")
df_June = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_June_2019.csv")
df_March = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_March_2019.csv")
df_May = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_May_2019.csv")
df_November = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_November_2019.csv")
df_October = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_October_2019.csv")
df_September = spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/Sales_September_2019.csv")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Use Union() method to combine data. Data structrure for this method should be the same. Union() method returns a set that contains all items from the original set, and all items from the specified set/s.&lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;&lt;CODE&gt;df_AllMonth = df_April.union(df_August).union(df_December).union(df_February).union(df_January).union(df_July).union(df_June).union(df_March).union(df_May).union(df_November).union(df_October).union(df_September)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;Conclusion:&lt;/U&gt;&lt;/P&gt;&lt;P&gt;I want to find an approach where I can merge data without saving it into variables. Is it possible? Maybe you can find better way how to do this task?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 06 Jun 2023 11:03:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3551#M146</guid>
      <dc:creator>AleksandraFrolo</dc:creator>
      <dc:date>2023-06-06T11:03:15Z</dc:date>
    </item>
    <item>
      <title>Re: Merge 12 CSV files in Databricks.</title>
      <link>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3552#M147</link>
      <description>&lt;P&gt;Ok some tips:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;you do not have to create a spark session on databricks, it is already created by databricks.  But your getOrCreate does not break anything&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;spark can read whole folders at once.  If you have 12 csv files in one folder, AND THEY HAVE THE SAME SCHEMA, you can try: spark.read.format("csv").option("delimiter", ",").option("header","true").load("dbfs:/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/)&lt;/P&gt;&lt;P&gt;Like that the whole folder is read.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Of course, if your files have a different structure, spark does not know what to do obviously so you will have to define a schema manually.&lt;/P&gt;</description>
      <pubDate>Wed, 07 Jun 2023 07:51:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3552#M147</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-06-07T07:51:03Z</dc:date>
    </item>
    <item>
      <title>Re: Merge 12 CSV files in Databricks.</title>
      <link>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3553#M148</link>
      <description>&lt;P&gt;It seems that all your csv files are present under one folder and since you are able to union them, all these files must have same schema as well.&lt;/P&gt;&lt;P&gt;Given the above conditions, you can simply read all the data by referring the folder name instead of referring to each file individually. &lt;/P&gt;</description>
      <pubDate>Wed, 07 Jun 2023 13:44:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3553#M148</guid>
      <dc:creator>Lakshay</dc:creator>
      <dc:date>2023-06-07T13:44:30Z</dc:date>
    </item>
    <item>
      <title>Re: Merge 12 CSV files in Databricks.</title>
      <link>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3554#M149</link>
      <description>&lt;P&gt;Hi, thank you for your answer! Yeah all structures of my csv files are the same.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I used method listdir() to get all names of the files and with "for cykle" I am reading my paths and csv files, and save it  into new dataframe.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Important:&lt;/B&gt; Actually, if I write "dbfs:/...." it doesn't work (I always get error like file isn't found), but when I use "/dbfs/" it works idk why&lt;span class="lia-unicode-emoji" title=":sad_but_relieved_face:"&gt;😥&lt;/span&gt; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Anyway this is correct code to read all csv files and concatenate it.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;folder_path = "/dbfs/FileStore/aleksandra.frolova@zebra.com/Sales Analysis/"
&amp;nbsp;
iles = os.listdir(folder_path) #returns list of all names of csv files in defined folder
&amp;nbsp;
df_all_months = pd.DataFrame() #create new DataFrame object
&amp;nbsp;
for file in files:
    df_of_single_file = pd.read_csv(folder_path + file) #store current dataframe
    df_all_months = pd.concat([df_all_months, df_of_single_file])&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 08 Jun 2023 06:33:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3554#M149</guid>
      <dc:creator>AleksandraFrolo</dc:creator>
      <dc:date>2023-06-08T06:33:05Z</dc:date>
    </item>
    <item>
      <title>Re: Merge 12 CSV files in Databricks.</title>
      <link>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3555#M150</link>
      <description>&lt;P&gt;Hello, thank you for answer! Yeah that is true, schema of all my csv files is the same and they all are located in one folder. I posted a solution above your message.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Jun 2023 06:35:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3555#M150</guid>
      <dc:creator>AleksandraFrolo</dc:creator>
      <dc:date>2023-06-08T06:35:08Z</dc:date>
    </item>
    <item>
      <title>Re: Merge 12 CSV files in Databricks.</title>
      <link>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3556#M151</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;this solution will work indeed but is far from optimal.&lt;/P&gt;&lt;P&gt;You can read a whole folder at once instead of using a loop.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The difference between dbfs:/ and /dbfs/ is just the type of file interface.&lt;/P&gt;&lt;P&gt;/dbfs/  is used by spark, so that is the reason it works in spark.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Jun 2023 07:02:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/merge-12-csv-files-in-databricks/m-p/3556#M151</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-06-08T07:02:53Z</dc:date>
    </item>
  </channel>
</rss>

