<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Migrating old solution to new optimal delta lake setup in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/migrating-old-solution-to-new-optimal-delta-lake-setup/m-p/35862#M25981</link>
    <description>&lt;P&gt;Hi Databricks community!&lt;/P&gt;&lt;P&gt;I have previsouly worked on a project that easily could be optimized with Databricks. It is currently running on Azure Synapse, but the premise is the same.&lt;/P&gt;&lt;P&gt;Ill describe the scenario here:&lt;/P&gt;&lt;P&gt;1. Data owners send a constant flow of JSON files into a datalake location (Gen 2 Blob Container). The files that are being sent consist individually of one business-record which is between 50 to 500 kbs per file (Very small).&lt;/P&gt;&lt;P&gt;2. As these files are populated on the datalake, they are in the same process partitioned down to the minute level (so, year, month, day, hour, minute), which is heavily over-partitioned.&amp;nbsp;&lt;/P&gt;&lt;P&gt;3. The incremental loading of the incomining files work completely fine, as they are written down in parquet files (not delta) on the lake, and then processed into an SQL Table.&lt;/P&gt;&lt;P&gt;4. This means that a full load is currently impossible because there are millions of files, and thousands of directories to traverse.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So essentially my question is, what would be the best way to move away from this old solution, and move into a full lakehouse setup? Is there an optimal way to pre-process millions of small JSON files in a once off load to keep them in Delta in a sort of landing layer?&amp;nbsp;&lt;/P&gt;&lt;P&gt;Ill take any suggestions with open hands!&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
    <pubDate>Wed, 28 Jun 2023 21:08:44 GMT</pubDate>
    <dc:creator>JohanBringsdal</dc:creator>
    <dc:date>2023-06-28T21:08:44Z</dc:date>
    <item>
      <title>Migrating old solution to new optimal delta lake setup</title>
      <link>https://community.databricks.com/t5/data-engineering/migrating-old-solution-to-new-optimal-delta-lake-setup/m-p/35862#M25981</link>
      <description>&lt;P&gt;Hi Databricks community!&lt;/P&gt;&lt;P&gt;I have previsouly worked on a project that easily could be optimized with Databricks. It is currently running on Azure Synapse, but the premise is the same.&lt;/P&gt;&lt;P&gt;Ill describe the scenario here:&lt;/P&gt;&lt;P&gt;1. Data owners send a constant flow of JSON files into a datalake location (Gen 2 Blob Container). The files that are being sent consist individually of one business-record which is between 50 to 500 kbs per file (Very small).&lt;/P&gt;&lt;P&gt;2. As these files are populated on the datalake, they are in the same process partitioned down to the minute level (so, year, month, day, hour, minute), which is heavily over-partitioned.&amp;nbsp;&lt;/P&gt;&lt;P&gt;3. The incremental loading of the incomining files work completely fine, as they are written down in parquet files (not delta) on the lake, and then processed into an SQL Table.&lt;/P&gt;&lt;P&gt;4. This means that a full load is currently impossible because there are millions of files, and thousands of directories to traverse.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So essentially my question is, what would be the best way to move away from this old solution, and move into a full lakehouse setup? Is there an optimal way to pre-process millions of small JSON files in a once off load to keep them in Delta in a sort of landing layer?&amp;nbsp;&lt;/P&gt;&lt;P&gt;Ill take any suggestions with open hands!&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Wed, 28 Jun 2023 21:08:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/migrating-old-solution-to-new-optimal-delta-lake-setup/m-p/35862#M25981</guid>
      <dc:creator>JohanBringsdal</dc:creator>
      <dc:date>2023-06-28T21:08:44Z</dc:date>
    </item>
  </channel>
</rss>

