<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Performance issue writing an extract of a huge unpartitionned single column dataframe in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/performance-issue-writing-an-extract-of-a-huge-unpartitionned/m-p/106148#M42403</link>
    <description>&lt;P&gt;I have a huge df (40 billions rows) shared by delta share that has only one column 'payload' which contains json and that is not partitionned:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="leymariv_2-1737155764713.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14207i23C4D3AC3511380E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="leymariv_2-1737155764713.png" alt="leymariv_2-1737155764713.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Even if all those payloads are not the same, they have a common col sessionId that i need to extract to be able to write my bronze table and partition using this sessionId.&lt;/P&gt;&lt;P&gt;I am trying to either use limit(10) or&amp;nbsp;&lt;SPAN&gt;sample&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;fraction&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;0.00000001&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;seed&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;42&lt;/SPAN&gt;&lt;SPAN&gt;)&amp;nbsp;&lt;/SPAN&gt;on the df to be able to treat a small part of it just to ensure my code is running but my write method has always the same number of tasks and would take too long to end.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="leymariv_0-1737155486874.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14205i3A30E2488F198828/image-size/medium?v=v2&amp;amp;px=400" role="button" title="leymariv_0-1737155486874.png" alt="leymariv_0-1737155486874.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;What would you suggest to write and exploded small subset this huge unpartitionned json column based df ?&lt;/P&gt;</description>
    <pubDate>Fri, 17 Jan 2025 23:23:39 GMT</pubDate>
    <dc:creator>leymariv</dc:creator>
    <dc:date>2025-01-17T23:23:39Z</dc:date>
    <item>
      <title>Performance issue writing an extract of a huge unpartitionned single column dataframe</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-writing-an-extract-of-a-huge-unpartitionned/m-p/106148#M42403</link>
      <description>&lt;P&gt;I have a huge df (40 billions rows) shared by delta share that has only one column 'payload' which contains json and that is not partitionned:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="leymariv_2-1737155764713.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14207i23C4D3AC3511380E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="leymariv_2-1737155764713.png" alt="leymariv_2-1737155764713.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Even if all those payloads are not the same, they have a common col sessionId that i need to extract to be able to write my bronze table and partition using this sessionId.&lt;/P&gt;&lt;P&gt;I am trying to either use limit(10) or&amp;nbsp;&lt;SPAN&gt;sample&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;fraction&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;0.00000001&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;seed&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;42&lt;/SPAN&gt;&lt;SPAN&gt;)&amp;nbsp;&lt;/SPAN&gt;on the df to be able to treat a small part of it just to ensure my code is running but my write method has always the same number of tasks and would take too long to end.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="leymariv_0-1737155486874.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14205i3A30E2488F198828/image-size/medium?v=v2&amp;amp;px=400" role="button" title="leymariv_0-1737155486874.png" alt="leymariv_0-1737155486874.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;What would you suggest to write and exploded small subset this huge unpartitionned json column based df ?&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jan 2025 23:23:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-writing-an-extract-of-a-huge-unpartitionned/m-p/106148#M42403</guid>
      <dc:creator>leymariv</dc:creator>
      <dc:date>2025-01-17T23:23:39Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue writing an extract of a huge unpartitionned single column dataframe</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-writing-an-extract-of-a-huge-unpartitionned/m-p/106187#M42417</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/144344"&gt;@leymariv&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;You can check the schema of data in delta sharing table, using df.printSchema to better understand the JSON structure. Use&amp;nbsp;&lt;EM&gt;&lt;STRONG&gt;from_json&lt;/STRONG&gt;&lt;/EM&gt; function to flatten or normalize the data to respective columns.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Additionally, you can understand how data is being loaded into the table by using the&amp;nbsp;&lt;/SPAN&gt;DESCRIBE HISTORY&lt;SPAN&gt;&amp;nbsp;command. Look for append or merge conditions in the&amp;nbsp;&lt;/SPAN&gt;operation&lt;SPAN&gt;&amp;nbsp;column and refer to the&amp;nbsp;&lt;/SPAN&gt;operationMetrics&lt;SPAN&gt;&amp;nbsp;column for data metrics.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_0-1737209659726.png" style="width: 875px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14217i7DE89A58D3B07D12/image-dimensions/875x57?v=v2" width="875" height="57" role="button" title="hariprasad_0-1737209659726.png" alt="hariprasad_0-1737209659726.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;If you notice that data is being loaded incrementally (append or merge) into the Delta Sharing table, you can read the data version by version or timestamp by timestamp using below code.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_1-1737209712122.png" style="width: 1182px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14218iC9E3A89F705D84B2/image-dimensions/1182x263?v=v2" width="1182" height="263" role="button" title="hariprasad_1-1737209712122.png" alt="hariprasad_1-1737209712122.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Alternatively, you can specify a range for the timestamp or version to further narrow down the data read.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_2-1737209739138.png" style="width: 890px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14219i401AD4112DC8BEC8/image-dimensions/890x394?v=v2" width="890" height="394" role="button" title="hariprasad_2-1737209739138.png" alt="hariprasad_2-1737209739138.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;Further, you can leverage &lt;EM&gt;&lt;STRONG&gt;Spark Structured Streaming&lt;/STRONG&gt;&lt;/EM&gt; to read data from delta sharing table.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_4-1737209788947.png" style="width: 828px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14221iC0C6A9E094353F34/image-dimensions/828x98?v=v2" width="828" height="98" role="button" title="hariprasad_4-1737209788947.png" alt="hariprasad_4-1737209788947.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards,&lt;BR /&gt;Hari Prasad&lt;/P&gt;</description>
      <pubDate>Sat, 18 Jan 2025 14:16:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-writing-an-extract-of-a-huge-unpartitionned/m-p/106187#M42417</guid>
      <dc:creator>hari-prasad</dc:creator>
      <dc:date>2025-01-18T14:16:52Z</dc:date>
    </item>
  </channel>
</rss>

