<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Stream processing large number of JSON files and handling exception in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/stream-processing-large-number-of-json-files-and-handling/m-p/112981#M9220</link>
    <description>&lt;P&gt;We have customers that read millions of files per hour+ using Databricks Auto Loader. For high-volume use cases, we recommend enabling &lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode" target="_self"&gt;file notification mode,&lt;/A&gt;&amp;nbsp;which, instead of continuously performing list operations on the filesystem, uses cloud native offerings to "push" new file locations to Auto Loader.&lt;/P&gt;
&lt;P&gt;For handling bad records, we have the&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/query/formats/json#rescued-data-column" target="_self"&gt;&lt;SPAN&gt;rescuedDataColumn&lt;/SPAN&gt;&lt;/A&gt; and for JSON files in particular,&amp;nbsp;&lt;SPAN&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/from_json#:~:text=meets%20corrupted%20records.-,columnNameOfCorruptRecord,-(default%20is%20the" target="_self"&gt;columnNameOfCorruptRecord&lt;/A&gt;. You can use these in tandem with some additional filtering logic to quarantine bad records away.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 18 Mar 2025 21:55:50 GMT</pubDate>
    <dc:creator>cgrant</dc:creator>
    <dc:date>2025-03-18T21:55:50Z</dc:date>
    <item>
      <title>Stream processing large number of JSON files and handling exception</title>
      <link>https://community.databricks.com/t5/get-started-discussions/stream-processing-large-number-of-json-files-and-handling/m-p/112727#M9219</link>
      <description>&lt;OL&gt;&lt;LI&gt;application writes several JSON &amp;nbsp;(small) &amp;nbsp;files and the expected volumes of these files are high ( Estimate: 1 million during the peak season in a hourly window) . As per current design, these files are streamed through Spark Stream and we use autoloader to load these files.&amp;nbsp;&lt;/LI&gt;&lt;OL&gt;&lt;LI&gt;Can Databricks bronze Job handle the volume without any failures loading to Bronze table. ?&lt;/LI&gt;&lt;LI&gt;Reading several forms appears streaming a large parquet file is not ideal .&lt;/LI&gt;&lt;/OL&gt;&lt;LI&gt;Handling Stream Processing failures through quarantine process&lt;/LI&gt;&lt;OL&gt;&lt;LI&gt;Any exception occurring we would like to write those records for operation’s review. Are there any best practices / reference material with example for handling the same?&lt;/LI&gt;&lt;/OL&gt;&lt;/OL&gt;</description>
      <pubDate>Sun, 16 Mar 2025 13:42:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/stream-processing-large-number-of-json-files-and-handling/m-p/112727#M9219</guid>
      <dc:creator>VijayP</dc:creator>
      <dc:date>2025-03-16T13:42:05Z</dc:date>
    </item>
    <item>
      <title>Re: Stream processing large number of JSON files and handling exception</title>
      <link>https://community.databricks.com/t5/get-started-discussions/stream-processing-large-number-of-json-files-and-handling/m-p/112981#M9220</link>
      <description>&lt;P&gt;We have customers that read millions of files per hour+ using Databricks Auto Loader. For high-volume use cases, we recommend enabling &lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode" target="_self"&gt;file notification mode,&lt;/A&gt;&amp;nbsp;which, instead of continuously performing list operations on the filesystem, uses cloud native offerings to "push" new file locations to Auto Loader.&lt;/P&gt;
&lt;P&gt;For handling bad records, we have the&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/query/formats/json#rescued-data-column" target="_self"&gt;&lt;SPAN&gt;rescuedDataColumn&lt;/SPAN&gt;&lt;/A&gt; and for JSON files in particular,&amp;nbsp;&lt;SPAN&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/from_json#:~:text=meets%20corrupted%20records.-,columnNameOfCorruptRecord,-(default%20is%20the" target="_self"&gt;columnNameOfCorruptRecord&lt;/A&gt;. You can use these in tandem with some additional filtering logic to quarantine bad records away.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Mar 2025 21:55:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/stream-processing-large-number-of-json-files-and-handling/m-p/112981#M9220</guid>
      <dc:creator>cgrant</dc:creator>
      <dc:date>2025-03-18T21:55:50Z</dc:date>
    </item>
  </channel>
</rss>

