<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Exploring parallelism for multiple tables in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/exploring-parallelism-for-multiple-tables/m-p/117073#M45420</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/150633"&gt;@suja&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Use Databricks Workflows (Jobs) with Task Parallelism&lt;BR /&gt;Instead of using threads within a single notebook, leverage Databricks Jobs to define multiple tasks, each responsible for a table. Tasks can:&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1. Run in parallel&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;2. Be modular and reusable&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;3. Be monitored and retried independently&lt;BR /&gt;Each task (or task group) would represent processing for one Hive table from Bronze → Silver → Gold.&lt;/P&gt;&lt;P&gt;Avoid Using Threads for Spark Workloads&lt;BR /&gt;Using Python threads for Spark workloads is not recommended, because:&lt;BR /&gt;Spark is already distributed.&lt;BR /&gt;Threads don’t provide real parallelism in Python (due to GIL)&lt;BR /&gt;You lose visibility, fault tolerance, and scalability.&lt;/P&gt;&lt;P&gt;Use Databricks Workflows with parallel tasks—each processing one Hive table through Bronze → Silver → Gold—and writing to relational DB. Avoid threading and instead modularize processing via parameterized notebooks or scripts.&lt;BR /&gt;Spark jobs scale better via job tasks rather than threads&lt;/P&gt;</description>
    <pubDate>Wed, 30 Apr 2025 03:43:57 GMT</pubDate>
    <dc:creator>lingareddy_Alva</dc:creator>
    <dc:date>2025-04-30T03:43:57Z</dc:date>
    <item>
      <title>Exploring parallelism for multiple tables</title>
      <link>https://community.databricks.com/t5/data-engineering/exploring-parallelism-for-multiple-tables/m-p/117068#M45419</link>
      <description>&lt;P&gt;I am new to databricks. The app we need to build reads from hive tables, go thru bronze, silver and gold layers and store in relational db tables.&amp;nbsp; There are multiple hive tables with no dependencies. What is the best way to achieve parallelism. Do we use threads thru each layers to process the multiple tables or run as separate tasks in jobs or any other suggestions. What would be the efficient way of implementation. Thanks&lt;/P&gt;</description>
      <pubDate>Wed, 30 Apr 2025 02:48:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/exploring-parallelism-for-multiple-tables/m-p/117068#M45419</guid>
      <dc:creator>suja</dc:creator>
      <dc:date>2025-04-30T02:48:49Z</dc:date>
    </item>
    <item>
      <title>Re: Exploring parallelism for multiple tables</title>
      <link>https://community.databricks.com/t5/data-engineering/exploring-parallelism-for-multiple-tables/m-p/117073#M45420</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/150633"&gt;@suja&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Use Databricks Workflows (Jobs) with Task Parallelism&lt;BR /&gt;Instead of using threads within a single notebook, leverage Databricks Jobs to define multiple tasks, each responsible for a table. Tasks can:&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1. Run in parallel&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;2. Be modular and reusable&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;3. Be monitored and retried independently&lt;BR /&gt;Each task (or task group) would represent processing for one Hive table from Bronze → Silver → Gold.&lt;/P&gt;&lt;P&gt;Avoid Using Threads for Spark Workloads&lt;BR /&gt;Using Python threads for Spark workloads is not recommended, because:&lt;BR /&gt;Spark is already distributed.&lt;BR /&gt;Threads don’t provide real parallelism in Python (due to GIL)&lt;BR /&gt;You lose visibility, fault tolerance, and scalability.&lt;/P&gt;&lt;P&gt;Use Databricks Workflows with parallel tasks—each processing one Hive table through Bronze → Silver → Gold—and writing to relational DB. Avoid threading and instead modularize processing via parameterized notebooks or scripts.&lt;BR /&gt;Spark jobs scale better via job tasks rather than threads&lt;/P&gt;</description>
      <pubDate>Wed, 30 Apr 2025 03:43:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/exploring-parallelism-for-multiple-tables/m-p/117073#M45420</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-04-30T03:43:57Z</dc:date>
    </item>
  </channel>
</rss>

