<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: parallelizing function call in databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/58846#M31278</link>
    <description>&lt;P&gt;AFAIK threadpool works on a single machine.&amp;nbsp; So by using it you cannot scale out to multiple nodes.&lt;BR /&gt;These tables you are talking about, are these spark tables or from a database?&lt;/P&gt;</description>
    <pubDate>Wed, 31 Jan 2024 14:19:44 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2024-01-31T14:19:44Z</dc:date>
    <item>
      <title>parallelizing function call in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/57690#M30865</link>
      <description>&lt;P&gt;I have a use case where I have to process stream data and have to create categorical table's(500 table count). I'm using concurrent threadpools to parallelize the whole process, but while seeing the spark UI, my code dosen't utilizes all the workers(Cluster configuration: Standard_e8ads type for both driver and worker, and 4 workers 32gb memory and 4 cores each). I'm using 4 threads.&lt;/P&gt;&lt;P&gt;the code sometimes executes on the driver or the worker, I never get utilization more than 40 to 45% for 5million records.&lt;/P&gt;&lt;P&gt;The function I call using threadpool has all spark code in it.&lt;/P&gt;&lt;P&gt;Any help on the issue will be highly appriciated, and thanks in advance.&lt;/P&gt;</description>
      <pubDate>Thu, 18 Jan 2024 09:32:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/57690#M30865</guid>
      <dc:creator>Shivanshu_</dc:creator>
      <dc:date>2024-01-18T09:32:11Z</dc:date>
    </item>
    <item>
      <title>Re: parallelizing function call in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/58846#M31278</link>
      <description>&lt;P&gt;AFAIK threadpool works on a single machine.&amp;nbsp; So by using it you cannot scale out to multiple nodes.&lt;BR /&gt;These tables you are talking about, are these spark tables or from a database?&lt;/P&gt;</description>
      <pubDate>Wed, 31 Jan 2024 14:19:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/58846#M31278</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-01-31T14:19:44Z</dc:date>
    </item>
    <item>
      <title>Re: parallelizing function call in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/58847#M31279</link>
      <description>&lt;P&gt;Spark tables&lt;/P&gt;</description>
      <pubDate>Wed, 31 Jan 2024 14:25:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/58847#M31279</guid>
      <dc:creator>Shivanshu_</dc:creator>
      <dc:date>2024-01-31T14:25:11Z</dc:date>
    </item>
    <item>
      <title>Re: parallelizing function call in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/58848#M31280</link>
      <description>&lt;P&gt;why not creating a single table with 500 partitions?&lt;BR /&gt;If that is not an option, you could still write the data as a partitioned parquet file and then create tables out of each partition using a small python script.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 31 Jan 2024 14:39:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/58848#M31280</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-01-31T14:39:44Z</dc:date>
    </item>
    <item>
      <title>Re: parallelizing function call in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/58882#M31287</link>
      <description>&lt;P&gt;You can use DLT, read from many-to-one table.&lt;/P&gt;</description>
      <pubDate>Wed, 31 Jan 2024 19:03:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallelizing-function-call-in-databricks/m-p/58882#M31287</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2024-01-31T19:03:35Z</dc:date>
    </item>
  </channel>
</rss>

