<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic speed up a for loop in python (azure databrick) in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/speed-up-a-for-loop-in-python-azure-databrick/m-p/26154#M18267</link>
    <description>&lt;P&gt;code example&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;# a list of file path&lt;/P&gt;&lt;P&gt;list_files_path = ["/dbfs/mnt/...", ..., "/dbfs/mnt/..."]&lt;/P&gt;&lt;P&gt;# copy all file above to this folder&lt;/P&gt;&lt;P&gt;dest_path=""/dbfs/mnt/..."&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for file_path in list_files_path:&lt;/P&gt;&lt;P&gt;     # copy function&lt;/P&gt;&lt;P&gt;    copy_file(file_path, dest_path)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am running it in the azure databrick and it works fine. But I am wondering if I can utilize the power of parallel of cluster in the databrick.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I know that I can run the some kind of multi-threading in the master node but I am wondering if I can use pandas_udf to take advantage of work nodes as well.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks! &lt;/P&gt;</description>
    <pubDate>Tue, 08 Mar 2022 22:55:26 GMT</pubDate>
    <dc:creator>Jackie</dc:creator>
    <dc:date>2022-03-08T22:55:26Z</dc:date>
    <item>
      <title>speed up a for loop in python (azure databrick)</title>
      <link>https://community.databricks.com/t5/data-engineering/speed-up-a-for-loop-in-python-azure-databrick/m-p/26154#M18267</link>
      <description>&lt;P&gt;code example&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;# a list of file path&lt;/P&gt;&lt;P&gt;list_files_path = ["/dbfs/mnt/...", ..., "/dbfs/mnt/..."]&lt;/P&gt;&lt;P&gt;# copy all file above to this folder&lt;/P&gt;&lt;P&gt;dest_path=""/dbfs/mnt/..."&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for file_path in list_files_path:&lt;/P&gt;&lt;P&gt;     # copy function&lt;/P&gt;&lt;P&gt;    copy_file(file_path, dest_path)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am running it in the azure databrick and it works fine. But I am wondering if I can utilize the power of parallel of cluster in the databrick.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I know that I can run the some kind of multi-threading in the master node but I am wondering if I can use pandas_udf to take advantage of work nodes as well.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks! &lt;/P&gt;</description>
      <pubDate>Tue, 08 Mar 2022 22:55:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/speed-up-a-for-loop-in-python-azure-databrick/m-p/26154#M18267</guid>
      <dc:creator>Jackie</dc:creator>
      <dc:date>2022-03-08T22:55:26Z</dc:date>
    </item>
    <item>
      <title>Re: speed up a for loop in python (azure databrick)</title>
      <link>https://community.databricks.com/t5/data-engineering/speed-up-a-for-loop-in-python-azure-databrick/m-p/26155#M18268</link>
      <description>&lt;P&gt;@Jackie Chan​&amp;nbsp;, To use spark parallelism you could register both destination as tables an use COPY INTO or register just source as table and use CREATE TABLE CLONE.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you want to use normal copy it is better to use dbutils.fs library&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you want to copy regularly data between ADSL/blobs nothing can catch up with Azure Data Factory. There you can make copy pipeline, it will be cheapest and fastest. If you need depedency to tun databricks notebook before/after copy you can orchestrate it there (on successful run databricks notebook etc.) as databricks is integrated with ADF.&lt;/P&gt;</description>
      <pubDate>Wed, 09 Mar 2022 10:55:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/speed-up-a-for-loop-in-python-azure-databrick/m-p/26155#M18268</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-03-09T10:55:38Z</dc:date>
    </item>
    <item>
      <title>Re: speed up a for loop in python (azure databrick)</title>
      <link>https://community.databricks.com/t5/data-engineering/speed-up-a-for-loop-in-python-azure-databrick/m-p/26156#M18269</link>
      <description>&lt;P&gt;@Jackie Chan​&amp;nbsp;, Indeed ADF has massive throughput. So go for ADF if you want a plain copy (so no transformations).&lt;/P&gt;</description>
      <pubDate>Thu, 10 Mar 2022 07:04:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/speed-up-a-for-loop-in-python-azure-databrick/m-p/26156#M18269</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-03-10T07:04:52Z</dc:date>
    </item>
    <item>
      <title>Re: speed up a for loop in python (azure databrick)</title>
      <link>https://community.databricks.com/t5/data-engineering/speed-up-a-for-loop-in-python-azure-databrick/m-p/26158#M18271</link>
      <description>&lt;P&gt;@Jackie Chan​&amp;nbsp;, What's the data size you want to copy? If it's bigger, then use ADF.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Apr 2022 02:07:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/speed-up-a-for-loop-in-python-azure-databrick/m-p/26158#M18271</guid>
      <dc:creator>Hemant</dc:creator>
      <dc:date>2022-04-28T02:07:59Z</dc:date>
    </item>
  </channel>
</rss>

