<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Best Way to process large number of records from multiple files in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82404#M36638</link>
    <description>&lt;P&gt;Are the json files compressed?&amp;nbsp; If they are in .gz, this is unsplittable which means you lose some of spark's parallel magic.&lt;/P&gt;</description>
    <pubDate>Thu, 08 Aug 2024 14:24:46 GMT</pubDate>
    <dc:creator>lprevost</dc:creator>
    <dc:date>2024-08-08T14:24:46Z</dc:date>
    <item>
      <title>Best Way to process large number of records from multiple files</title>
      <link>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82378#M36628</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have input files in S3 with below structure.&lt;/P&gt;&lt;P&gt;/mnt/&amp;lt;mount_name&amp;gt;/test/&amp;lt;company_id&amp;gt;/sales/file_1.json&lt;/P&gt;&lt;P&gt;/mnt/&amp;lt;mount_name&amp;gt;/test/&amp;lt;company_id&amp;gt;/sales/file_2.json&lt;/P&gt;&lt;P&gt;/mnt/&amp;lt;mount_name&amp;gt;/test/&amp;lt;company_id&amp;gt;/sales/file_&amp;lt;n&amp;gt;.json&lt;/P&gt;&lt;P&gt;Number of companies = 15&lt;/P&gt;&lt;P&gt;Number of files per company = 30&lt;/P&gt;&lt;P&gt;Total files = 450&lt;/P&gt;&lt;P&gt;Each file contains nearly 180000 records&lt;/P&gt;&lt;P&gt;My question is what the best way to read the file and insert records to Database table.&lt;/P&gt;&lt;P&gt;After reading the file i need to do below operations.&lt;/P&gt;&lt;P&gt;1. Typecast the column&lt;/P&gt;&lt;P&gt;2. Derive some columns from existing columns&lt;/P&gt;&lt;P&gt;3. Filter bad records&lt;/P&gt;&lt;P&gt;4. Join with Item dataframe&lt;/P&gt;&lt;P&gt;5. Filter records which are not matching with item data&lt;/P&gt;&lt;P&gt;6.&amp;nbsp; insert to DB table&lt;/P&gt;&lt;P&gt;7. Write error records to one file(error file) and Write completed one file(completed file)&lt;/P&gt;&lt;P&gt;My Approach:&lt;/P&gt;&lt;P&gt;1. I read all the files in multithreading and write to one location with parquet format (if i write with delta format it takes more time to write and also in multithreading it will fail to write because before writing delta table, table should be created) - this is taking nearly 30 minutes&lt;/P&gt;&lt;P&gt;2. Once all the file written data to one location with parquet format, i read and start processing records(one dataframe with nearly 81,000,000 records) - this is taking several&amp;nbsp;hours to process the records.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Aug 2024 12:21:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82378#M36628</guid>
      <dc:creator>Policepatil</dc:creator>
      <dc:date>2024-08-08T12:21:51Z</dc:date>
    </item>
    <item>
      <title>Re: Best Way to process large number of records from multiple files</title>
      <link>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82384#M36631</link>
      <description>&lt;P&gt;Compute: Multinode cluster&lt;/P&gt;&lt;P&gt;Driver Type: i3.xlarge - 30.5GB, 4 cores&lt;/P&gt;&lt;P&gt;Worker Type: i3.xlarge -&amp;nbsp;30.5GB, 4 cores&lt;/P&gt;&lt;P&gt;Total number of workers:&amp;nbsp; 4&lt;/P&gt;</description>
      <pubDate>Thu, 08 Aug 2024 12:28:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82384#M36631</guid>
      <dc:creator>Policepatil</dc:creator>
      <dc:date>2024-08-08T12:28:04Z</dc:date>
    </item>
    <item>
      <title>Re: Best Way to process large number of records from multiple files</title>
      <link>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82404#M36638</link>
      <description>&lt;P&gt;Are the json files compressed?&amp;nbsp; If they are in .gz, this is unsplittable which means you lose some of spark's parallel magic.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Aug 2024 14:24:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82404#M36638</guid>
      <dc:creator>lprevost</dc:creator>
      <dc:date>2024-08-08T14:24:46Z</dc:date>
    </item>
    <item>
      <title>Re: Best Way to process large number of records from multiple files</title>
      <link>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82551#M36677</link>
      <description>&lt;P&gt;No files are not compressed.&lt;/P&gt;</description>
      <pubDate>Fri, 09 Aug 2024 12:29:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82551#M36677</guid>
      <dc:creator>Policepatil</dc:creator>
      <dc:date>2024-08-09T12:29:42Z</dc:date>
    </item>
    <item>
      <title>Re: Best Way to process large number of records from multiple files</title>
      <link>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82563#M36684</link>
      <description>&lt;P&gt;I’m on my learning curve too but here are a few thoughts for you:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;use AUTOLOADER to either read them all in one pass or by company folder in multiple passes. &amp;nbsp;It doesn’t look like much data. &amp;nbsp;I’m doing similar with csv files and each partition has 2B records. &amp;nbsp;Takes like 15-20 min. &amp;nbsp;. &amp;nbsp;Use the read statements glob patterns to target&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;&lt;LI&gt;don't do too much transformation on your first stage. &amp;nbsp;Use the medallion architecture to create a bronze table that just has your raw data and maybe directory paths using _metadata. Put all of it in DBs default &amp;nbsp;which is delta. &amp;nbsp;I don’t think your lag is due to delta vs parquet. &amp;nbsp;I am thinking it’s due to either lack of partitioning or big variations on file size that cause skew to your jobs. &amp;nbsp;You could read, them repartition, then write. &amp;nbsp;&lt;/LI&gt;&lt;LI&gt;do a second stage to create your silver and gold tables for your items 2-5. &amp;nbsp;Some of that could include a partition or cluster column like company. &amp;nbsp;Also could be when you do join. I bet you’ll get much better performance reading from your bronze delta table to do your item join than if you are reading from your raw json&lt;/LI&gt;&lt;LI&gt;On your approach item 2 - it shouldn’t take so long — I wonder how you are doing the transforms and joins. &amp;nbsp; That’s a big topic there and too broad to give guidance. &amp;nbsp;But, I see that as your subsequent step, not your first which is to create a raw bronze single table. &amp;nbsp;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Fri, 09 Aug 2024 13:19:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/82563#M36684</guid>
      <dc:creator>lprevost</dc:creator>
      <dc:date>2024-08-09T13:19:25Z</dc:date>
    </item>
    <item>
      <title>Re: Best Way to process large number of records from multiple files</title>
      <link>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/83705#M36991</link>
      <description>&lt;P&gt;Thanks your response i will check and let know&lt;/P&gt;</description>
      <pubDate>Wed, 21 Aug 2024 05:35:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-way-to-process-large-number-of-records-from-multiple-files/m-p/83705#M36991</guid>
      <dc:creator>Policepatil</dc:creator>
      <dc:date>2024-08-21T05:35:58Z</dc:date>
    </item>
  </channel>
</rss>

