<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Network bottleneck in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/93395#M38693</link>
    <description>&lt;P&gt;Thank you.&lt;/P&gt;</description>
    <pubDate>Thu, 10 Oct 2024 06:54:27 GMT</pubDate>
    <dc:creator>ZoeCole</dc:creator>
    <dc:date>2024-10-10T06:54:27Z</dc:date>
    <item>
      <title>Network bottleneck</title>
      <link>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/90712#M37981</link>
      <description>&lt;P&gt;Within a script, I noticed that the network connection between driver and the mounted network drives is often a huge bottleneck. It seems that the network through speed is unreasonable low for being an Azure&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Single node: Standard_DS12_v2 · DBR: 14.3.x-photon-scala2.12&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Are there some ways how to improve upon the storing of a result to an Azure Blob storage? My current code looks like this:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;joined_df.write.&lt;/SPAN&gt;&lt;SPAN&gt;partitionBy&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"IdStation"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;mode&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;parquet&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"/mnt/temp_folder"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Especially the IO wait of the CPU is more than just weird.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 17 Sep 2024 11:02:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/90712#M37981</guid>
      <dc:creator>jenshumrich</dc:creator>
      <dc:date>2024-09-17T11:02:14Z</dc:date>
    </item>
    <item>
      <title>Re: Network bottleneck</title>
      <link>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/90715#M37982</link>
      <description>&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="cpu.JPG" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11283iFAD2ECFB092B3E3D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="cpu.JPG" alt="cpu.JPG" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt; &lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="network.JPG" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11284iEF1E736D054B2C81/image-size/medium?v=v2&amp;amp;px=400" role="button" title="network.JPG" alt="network.JPG" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt; Here you can see the really slow network traffic, causing iowait on the CPU&lt;/P&gt;</description>
      <pubDate>Tue, 17 Sep 2024 11:03:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/90715#M37982</guid>
      <dc:creator>jenshumrich</dc:creator>
      <dc:date>2024-09-17T11:03:23Z</dc:date>
    </item>
    <item>
      <title>Re: Network bottleneck</title>
      <link>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/90767#M37991</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102237"&gt;@jenshumrich&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;There is partitioning by&amp;nbsp;IdStation. How many partitions are created? Isn't it a problem with too many files?&lt;BR /&gt;The partition size should around 1 GB and the file size should be or around 128 MB.&lt;BR /&gt;&lt;BR /&gt;I see a lot of IO wait, so this would go in line with my suspicion that too many files are created.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Sep 2024 17:20:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/90767#M37991</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-09-17T17:20:08Z</dc:date>
    </item>
    <item>
      <title>Re: Network bottleneck</title>
      <link>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/90868#M38013</link>
      <description>&lt;P&gt;You are right. I am creating 200 small files with the size of roughly 6 MB (in the quality system) and a few 100000s files in production. The partition is motivated by the original business need and further processing. Let me test with a the different partitioning.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 18 Sep 2024 11:04:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/90868#M38013</guid>
      <dc:creator>jenshumrich</dc:creator>
      <dc:date>2024-09-18T11:04:55Z</dc:date>
    </item>
    <item>
      <title>Re: Network bottleneck</title>
      <link>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/93395#M38693</link>
      <description>&lt;P&gt;Thank you.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Oct 2024 06:54:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/network-bottleneck/m-p/93395#M38693</guid>
      <dc:creator>ZoeCole</dc:creator>
      <dc:date>2024-10-10T06:54:27Z</dc:date>
    </item>
  </channel>
</rss>

