<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Databricks Python function achieving Parallelism in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-python-function-achieving-parallelism/m-p/98305#M39681</link>
    <description>&lt;P&gt;any help here , thanks&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 11 Nov 2024 05:06:55 GMT</pubDate>
    <dc:creator>sathya08</dc:creator>
    <dc:date>2024-11-11T05:06:55Z</dc:date>
    <item>
      <title>Databricks Python function achieving Parallelism</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-python-function-achieving-parallelism/m-p/98021#M39585</link>
      <description>&lt;P&gt;Hello everyone,&lt;/P&gt;&lt;P&gt;I have a very basic question wrt Databricks spark parallelism.&lt;/P&gt;&lt;P&gt;I have a python function within a for loop, so I believe this is running sequentially.&lt;/P&gt;&lt;P&gt;Databricks cluster is enabled with Photon and with Spark 15x, does that mean the driver is responsible to make this to run in parallel even though it is in a for loop OR do I need to introduce something to make the function to run in parallel.&lt;/P&gt;&lt;P&gt;Need you help to understand on the above one and if I need to introduce parallelism explicitly then how do I do it.&lt;/P&gt;&lt;P&gt;Also how to achieve it based on the total executors cores in the cluster [ I read executor cores are responsible for the parallelism ].&lt;/P&gt;&lt;P&gt;Please correct me if my understanding is wrong.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Sathya&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 06 Nov 2024 22:40:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-python-function-achieving-parallelism/m-p/98021#M39585</guid>
      <dc:creator>sathya08</dc:creator>
      <dc:date>2024-11-06T22:40:24Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Python function achieving Parallelism</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-python-function-achieving-parallelism/m-p/98075#M39597</link>
      <description>&lt;P&gt;&lt;SPAN&gt;In Spark, the level of parallelism is determined by the number of partitions and the number of executor cores. Each task runs on a single core, so having more executor cores allows more tasks to run in parallel.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;To achieve parallelism, you need to explicitly introduce parallel processing. Spark provides several ways to achieve parallelism, such as using&amp;nbsp;map&amp;nbsp;operations on RDDs or DataFrames, or using the&amp;nbsp;concurrent.futures&amp;nbsp;module in Python.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;1)&amp;nbsp;&lt;SPAN&gt;If your function can be applied to elements of an RDD or DataFrame, you can use Spark's&amp;nbsp;&lt;/SPAN&gt;map&lt;SPAN&gt;&amp;nbsp;operation to run it in parallel across the cluster.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# Define your function&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;def&lt;/SPAN&gt; &lt;SPAN&gt;my_function&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;x)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;return&lt;/SPAN&gt; &lt;SPAN&gt;x&lt;/SPAN&gt; &lt;SPAN&gt;*&lt;/SPAN&gt; &lt;SPAN&gt;x&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Apply the function in parallel&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;result_rdd&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;rdd&lt;/SPAN&gt;&lt;SPAN&gt;.map(&lt;/SPAN&gt;&lt;SPAN&gt;my_function&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Collect the results&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;results&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;result_rdd&lt;/SPAN&gt;&lt;SPAN&gt;.collect()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;results&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;2)&amp;nbsp;If you need to run a Python function in parallel and it doesn't fit well with Spark's&amp;nbsp;map&amp;nbsp;operation, you can use the&amp;nbsp;concurrent.futures&amp;nbsp;module.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# Define your function&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;def&lt;/SPAN&gt; &lt;SPAN&gt;my_function&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;x)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;return&lt;/SPAN&gt; &lt;SPAN&gt;x&lt;/SPAN&gt; &lt;SPAN&gt;*&lt;/SPAN&gt; &lt;SPAN&gt;x&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Example data&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;data&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; [&lt;/SPAN&gt;&lt;SPAN&gt;1&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;2&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;3&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;4&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;5&lt;/SPAN&gt;&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Use ThreadPoolExecutor or ProcessPoolExecutor&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;with&lt;/SPAN&gt; &lt;SPAN&gt;concurrent&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;futures&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;ThreadPoolExecutor&lt;/SPAN&gt;&lt;SPAN&gt;() &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt; &lt;SPAN&gt;executor&lt;/SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;results&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;list&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;executor&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;map&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;my_function&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;data&lt;/SPAN&gt;&lt;SPAN&gt;))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;results&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 07 Nov 2024 13:06:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-python-function-achieving-parallelism/m-p/98075#M39597</guid>
      <dc:creator>saurabh18cs</dc:creator>
      <dc:date>2024-11-07T13:06:49Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Python function achieving Parallelism</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-python-function-achieving-parallelism/m-p/98147#M39623</link>
      <description>&lt;P&gt;Thankyou for your reply. My code is not writing to any target it is actually doing a optimize and vaccum on all the tables based on catalog. Currently in for loop it is taking one table at a time and sequentially performing the actions.&lt;/P&gt;&lt;P&gt;Can this be parallelize using the concurrent.future module or is there any other ways of doing it.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Sathya&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 07 Nov 2024 22:20:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-python-function-achieving-parallelism/m-p/98147#M39623</guid>
      <dc:creator>sathya08</dc:creator>
      <dc:date>2024-11-07T22:20:48Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Python function achieving Parallelism</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-python-function-achieving-parallelism/m-p/98305#M39681</link>
      <description>&lt;P&gt;any help here , thanks&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Nov 2024 05:06:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-python-function-achieving-parallelism/m-p/98305#M39681</guid>
      <dc:creator>sathya08</dc:creator>
      <dc:date>2024-11-11T05:06:55Z</dc:date>
    </item>
  </channel>
</rss>

