<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Read and transform CSVs in parallel. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32846#M23965</link>
    <description>&lt;P&gt;You can read all the csv files from a path using wildcard character like spark.read.csv("path/*.csv")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;However, since this will be processed in parallel, you wont get your records in exact same order as they exist in your csv files.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To achieve that, you need to sort your dataframe using a column that helps you identify which file a row is from.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you don't have such column, then porbably create one. You can first write a simple code to add a column 'row_num' to each of your csv file with row number.&lt;/P&gt;&lt;P&gt;eg. for file1 with 100 rows the value will be 0 to 99 then for next file with 100 rows it'll be 100 to 199 and so on.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Then you can read all the files at once and order it by column row_num.&lt;/P&gt;</description>
    <pubDate>Thu, 01 Sep 2022 13:19:10 GMT</pubDate>
    <dc:creator>AmanSehgal</dc:creator>
    <dc:date>2022-09-01T13:19:10Z</dc:date>
    <item>
      <title>Read and transform CSVs in parallel.</title>
      <link>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32845#M23964</link>
      <description>&lt;P&gt;I need to read and transform several CSV files and then append them to a single data frame. I am able to do this in databricks using simple for loops, but I would like to speed this up.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Below is the rough structure of my code: &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;for filepath in all_filepaths: 
    df1 = read_file(filepath)
&amp;nbsp;   df2 = transform(df1)
    df3 = df3.append(df2)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Rather than processing 1 file at a time is there a way to process them in parallel? There are plenty of solutions online but I could only get the following to work in databricks: &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;with ThreadPoolExecutor(max_workers=20) as pool:
    df3 = pd.concat(pool.map(read_and_transform_df, all_filepaths))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;For 153 files, the first approach took 3.35 mins and the second approach took 3.87 mins.&lt;/P&gt;&lt;P&gt;Is there a way to optimize the second approach or an alternative faster approach?  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks, &lt;/P&gt;&lt;P&gt;Tanjil &lt;/P&gt;</description>
      <pubDate>Thu, 01 Sep 2022 12:10:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32845#M23964</guid>
      <dc:creator>tanjil</dc:creator>
      <dc:date>2022-09-01T12:10:42Z</dc:date>
    </item>
    <item>
      <title>Re: Read and transform CSVs in parallel.</title>
      <link>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32846#M23965</link>
      <description>&lt;P&gt;You can read all the csv files from a path using wildcard character like spark.read.csv("path/*.csv")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;However, since this will be processed in parallel, you wont get your records in exact same order as they exist in your csv files.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To achieve that, you need to sort your dataframe using a column that helps you identify which file a row is from.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you don't have such column, then porbably create one. You can first write a simple code to add a column 'row_num' to each of your csv file with row number.&lt;/P&gt;&lt;P&gt;eg. for file1 with 100 rows the value will be 0 to 99 then for next file with 100 rows it'll be 100 to 199 and so on.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Then you can read all the files at once and order it by column row_num.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Sep 2022 13:19:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32846#M23965</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-09-01T13:19:10Z</dc:date>
    </item>
    <item>
      <title>Re: Read and transform CSVs in parallel.</title>
      <link>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32847#M23966</link>
      <description>&lt;P&gt;Hi, &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for your reply. I have been able to read all the files in a single attempt using pyspark:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spk_df1 = spark.read.text(all_filepaths.tolist()).option("header", "true").format("csv").option(delimiter = ',')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;However, I could not figure out how to perform any of the row-level transformations that are unique to each file. What would you suggest here? &lt;/P&gt;</description>
      <pubDate>Thu, 01 Sep 2022 14:12:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32847#M23966</guid>
      <dc:creator>tanjil</dc:creator>
      <dc:date>2022-09-01T14:12:34Z</dc:date>
    </item>
    <item>
      <title>Re: Read and transform CSVs in parallel.</title>
      <link>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32848#M23967</link>
      <description>&lt;P&gt;Can you filter rows for a particular file? if yes, then take subset of dataframe using where `(df.where())` clause and apply respective transformations to them.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Sep 2022 17:12:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32848#M23967</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-09-01T17:12:17Z</dc:date>
    </item>
    <item>
      <title>Re: Read and transform CSVs in parallel.</title>
      <link>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32849#M23968</link>
      <description>&lt;P&gt;Please consider loading multiple CSV using an autoloader.&lt;/P&gt;&lt;P&gt;This way, you will speed it up &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/schema" target="test_blank"&gt;https://docs.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/schema&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I know that it can be confusing as it is a stream but please add&lt;/P&gt;&lt;P&gt;.trigger(availableNow=True)&lt;/P&gt;&lt;P&gt;to process it just one time and finish after everything is loaded.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Sep 2022 20:07:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32849#M23968</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-09-01T20:07:45Z</dc:date>
    </item>
    <item>
      <title>Re: Read and transform CSVs in parallel.</title>
      <link>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32850#M23969</link>
      <description>&lt;P&gt;Great recommendation @Hubert Dudek​&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Sep 2022 23:34:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32850#M23969</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-09-09T23:34:18Z</dc:date>
    </item>
    <item>
      <title>Re: Read and transform CSVs in parallel.</title>
      <link>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32851#M23970</link>
      <description>&lt;P&gt;Hi @tanjil​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Sep 2022 04:56:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-and-transform-csvs-in-parallel/m-p/32851#M23970</guid>
      <dc:creator>Vidula</dc:creator>
      <dc:date>2022-09-21T04:56:48Z</dc:date>
    </item>
  </channel>
</rss>

