<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore.

I would like to understand why write performance is different in both cases. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-notebook-is-taking-2-hours-to-write-to-dbfs-mnt-blob/m-p/22375#M15314</link>
    <description>&lt;P&gt;It is about routing. When you use the local network inside the region, it will be super fast. Even not a local network but the same area is still really fast. However, it will be much slower when it has to be in another region, mainly when it uses public internet.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Something like 13x is what I am expecting. Additionally, routing not inside the local network will generate outbound traffic charges.&lt;/P&gt;</description>
    <pubDate>Mon, 25 Apr 2022 17:49:29 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2022-04-25T17:49:29Z</dc:date>
    <item>
      <title>Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore.

I would like to understand why write performance is different in both cases.</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-notebook-is-taking-2-hours-to-write-to-dbfs-mnt-blob/m-p/22371#M15310</link>
      <description>&lt;P&gt;&lt;B&gt;Problem statement:&lt;/B&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Source file format : .tar.gz&lt;/LI&gt;&lt;LI&gt;Avg size: 10 mb&lt;/LI&gt;&lt;LI&gt;number of tar.gz files: 1000&lt;/LI&gt;&lt;LI&gt;Each tar.gz file contails around 20000 csv files.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;B&gt;Requirement :&lt;/B&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Untar the tar.gz file and write CSV files to blob storage / intermediate storage layer for further processing.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;What I have tried:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;unTar and write to mount location (Attached Screenshot):&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here I am using hadoop FileUtil library unTar function to unTar and write CSV files to target storage (/dbfs/mnt/ - blob storage).&lt;/P&gt;&lt;P&gt; it takes 1.50 hours to complete the job with 2 worker nodes (4 cores each) cluster.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="databricks_write_to_dbfsMount"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1927iA829F34A9B34083C/image-size/large?v=v2&amp;amp;px=999" role="button" title="databricks_write_to_dbfsMount" alt="databricks_write_to_dbfsMount" /&gt;&lt;/span&gt;Untar and write to DBFS Root FileStore: &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/FileStore/ ) it takes just 8 minutes to complete the job with 2 worker nodes (4 cores each) cluster.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="databricks_write_to_dbfsMount"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1931i45849B215487D1B5/image-size/large?v=v2&amp;amp;px=999" role="button" title="databricks_write_to_dbfsMount" alt="databricks_write_to_dbfsMount" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Questions:&lt;/B&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Why writing to DBFS/FileStore or DBFS/databricks/driver is 15 times faster that writing to DBFS/mnt storage?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;what storage and file system does DBFS root (/FileStore , /databricks-datasets , /databricks/driver ) uses in backend? What is size limit for each sub folder?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 22 Apr 2022 08:38:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-notebook-is-taking-2-hours-to-write-to-dbfs-mnt-blob/m-p/22371#M15310</guid>
      <dc:creator>Surendra</dc:creator>
      <dc:date>2022-04-22T08:38:20Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore.

I would like to understand why write performance is different in both cases.</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-notebook-is-taking-2-hours-to-write-to-dbfs-mnt-blob/m-p/22372#M15311</link>
      <description>&lt;P&gt;@Surendranatha Reddy Chappidi​&amp;nbsp;, It seems that it is a problem with /dbfs/mnt mount, blob storage configuration:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;blob storage needs to be in the same availability zone as your Databricks,&lt;/LI&gt;&lt;LI&gt;please use a private link so traffic is routed locally, not through the internet (so in the network, there is a private subnet used by Databricks, and should be one more for remote endpoints)&lt;/LI&gt;&lt;LI&gt;please upgrade blob storage to ADLS2&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here I explained how to add ADLS2 and a private link: &lt;A href="https://community.databricks.com/s/feed/0D53f00001eQGOHCA4" alt="https://community.databricks.com/s/feed/0D53f00001eQGOHCA4" target="_blank"&gt;https://community.databricks.com/s/feed/0D53f00001eQGOHCA4&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 22 Apr 2022 09:27:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-notebook-is-taking-2-hours-to-write-to-dbfs-mnt-blob/m-p/22372#M15311</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-04-22T09:27:24Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore.

I would like to understand why write performance is different in both cases.</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-notebook-is-taking-2-hours-to-write-to-dbfs-mnt-blob/m-p/22374#M15313</link>
      <description>&lt;P&gt;@Hubert Dudek​&amp;nbsp; Thanks for your suggestions.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;After creating storage account in same region as databricks I can see that performance is as expected.&lt;/P&gt;&lt;P&gt;Now it is clear that issue is with /mnt/ location is being in different region than databricks.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt; I would like to understand why it takes 13x more time to write data to different region storage compared to same region storage account?&lt;/P&gt;&lt;P&gt;What is API / protocol does databricks uses in backend to write data to same region and different region ?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Why I concerned is because we are developing service for customers.&lt;/P&gt;&lt;P&gt;Customer can choose storage account region and data bricks account region while deploying this service in their subscription.&lt;/P&gt;&lt;P&gt;If both are different, then customer will face performance issues as I reported earlier.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;@Kaniz Fatma​&amp;nbsp; Kindly help here in understanding it takes 13x more time to write data to different region storage compared to same region storage account? &lt;/P&gt;</description>
      <pubDate>Mon, 25 Apr 2022 13:33:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-notebook-is-taking-2-hours-to-write-to-dbfs-mnt-blob/m-p/22374#M15313</guid>
      <dc:creator>Surendra</dc:creator>
      <dc:date>2022-04-25T13:33:04Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore.

I would like to understand why write performance is different in both cases.</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-notebook-is-taking-2-hours-to-write-to-dbfs-mnt-blob/m-p/22375#M15314</link>
      <description>&lt;P&gt;It is about routing. When you use the local network inside the region, it will be super fast. Even not a local network but the same area is still really fast. However, it will be much slower when it has to be in another region, mainly when it uses public internet.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Something like 13x is what I am expecting. Additionally, routing not inside the local network will generate outbound traffic charges.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Apr 2022 17:49:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-notebook-is-taking-2-hours-to-write-to-dbfs-mnt-blob/m-p/22375#M15314</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-04-25T17:49:29Z</dc:date>
    </item>
  </channel>
</rss>

