<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic using concurrent.futures for parallelization in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/using-concurrent-futures-for-parallelization/m-p/110674#M43638</link>
    <description>&lt;P&gt;Hi, trying to copy a table with billions of rows from an enterprise data source into my databricks table.&amp;nbsp; To do this, I need to use a homegrown library which handles auth etc, runs the query and return a dataframe.&amp;nbsp; I am partitioning the table using ~100queries.&amp;nbsp; The code runs perfectly when I run sequentially (setting `NUM_PARALLEL=1` below).&amp;nbsp; However when even set NUM_PARALLEL=2, it runs fine (as in things are running in parallel) but for the other time, I keep getting an exception `&lt;SPAN&gt;SparkSession$ does not exist in the JVM&lt;/SPAN&gt;`.&amp;nbsp; There seem to be no errors thrown either on the enterprise data source side or the other library (for which I can see the logs too).&amp;nbsp; Thx!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def process_row(row):
  st = time.time()
  tmpdir = f"/dbfs{prm['TMP_DIR']}/{row.table_name}/{row.query_id}"
  for trial in range(1,prm["MAX_TRIES"]+1):
    try:
      os.makedirs(f"{tmpdir}/{trial}", exist_ok=True)
      with suppress_stdout_stderr():
        sdf = ud.getDataFrame(query=row.query, save_to_dir=f"{tmpdir}/{trial}").get_spark_df()
      
      if sdf:
        sdf.write.mode("append").saveAsTable(f"{prm['DATABASE']}.{row.table_name}")
        cnt_rec = sdf.count()
      else:
        cnt_rec = 0
      
      qry = f"UPDATE {prm['DATABASE']}._extract SET completed = '{datetime.now()}' WHERE query_id = '{row.query_id}'"
      spark.sql(qry)
      
      print(" {:,} ({:.1f}min) ".format(cnt_rec,(time.time()-st)/60.), end=" ", flush=True)
      break
    except Exception as e:
      print(e)
      print(f" (retrying {trial}) ", end=" ", flush=True)
      time.sleep(trial*random.randint(30,60))
  
  time.sleep(5)
  shutil.rmtree(tmpdir, ignore_errors=True)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;code which is calling the above&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;for tbl,rows_iter in itertools.groupby(queries_lst, key=lambda x: x.table_name):
  rows_lst = list(rows_iter)
  print(f"table: {tbl}  partitions: {len(rows_lst):,}", flush=True)
  
  st0 = time.time()
  if prm["NUM_PARALLEL"]&amp;gt;1:
    with ProcessPoolExecutor(max_workers=prm["NUM_PARALLEL"]) as executor:
      job_list = [executor.submit(process_row, row) for row in rows_lst]
      for job in as_completed(job_list):
        pass
  
  else:
    for row in rows_lst:
      process_row(row)
  
  print(f"\n  time: {(time.time()-st0)/60.:.1f}min\n", flush=True)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 20 Feb 2025 00:19:09 GMT</pubDate>
    <dc:creator>ironv</dc:creator>
    <dc:date>2025-02-20T00:19:09Z</dc:date>
    <item>
      <title>using concurrent.futures for parallelization</title>
      <link>https://community.databricks.com/t5/data-engineering/using-concurrent-futures-for-parallelization/m-p/110674#M43638</link>
      <description>&lt;P&gt;Hi, trying to copy a table with billions of rows from an enterprise data source into my databricks table.&amp;nbsp; To do this, I need to use a homegrown library which handles auth etc, runs the query and return a dataframe.&amp;nbsp; I am partitioning the table using ~100queries.&amp;nbsp; The code runs perfectly when I run sequentially (setting `NUM_PARALLEL=1` below).&amp;nbsp; However when even set NUM_PARALLEL=2, it runs fine (as in things are running in parallel) but for the other time, I keep getting an exception `&lt;SPAN&gt;SparkSession$ does not exist in the JVM&lt;/SPAN&gt;`.&amp;nbsp; There seem to be no errors thrown either on the enterprise data source side or the other library (for which I can see the logs too).&amp;nbsp; Thx!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def process_row(row):
  st = time.time()
  tmpdir = f"/dbfs{prm['TMP_DIR']}/{row.table_name}/{row.query_id}"
  for trial in range(1,prm["MAX_TRIES"]+1):
    try:
      os.makedirs(f"{tmpdir}/{trial}", exist_ok=True)
      with suppress_stdout_stderr():
        sdf = ud.getDataFrame(query=row.query, save_to_dir=f"{tmpdir}/{trial}").get_spark_df()
      
      if sdf:
        sdf.write.mode("append").saveAsTable(f"{prm['DATABASE']}.{row.table_name}")
        cnt_rec = sdf.count()
      else:
        cnt_rec = 0
      
      qry = f"UPDATE {prm['DATABASE']}._extract SET completed = '{datetime.now()}' WHERE query_id = '{row.query_id}'"
      spark.sql(qry)
      
      print(" {:,} ({:.1f}min) ".format(cnt_rec,(time.time()-st)/60.), end=" ", flush=True)
      break
    except Exception as e:
      print(e)
      print(f" (retrying {trial}) ", end=" ", flush=True)
      time.sleep(trial*random.randint(30,60))
  
  time.sleep(5)
  shutil.rmtree(tmpdir, ignore_errors=True)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;code which is calling the above&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;for tbl,rows_iter in itertools.groupby(queries_lst, key=lambda x: x.table_name):
  rows_lst = list(rows_iter)
  print(f"table: {tbl}  partitions: {len(rows_lst):,}", flush=True)
  
  st0 = time.time()
  if prm["NUM_PARALLEL"]&amp;gt;1:
    with ProcessPoolExecutor(max_workers=prm["NUM_PARALLEL"]) as executor:
      job_list = [executor.submit(process_row, row) for row in rows_lst]
      for job in as_completed(job_list):
        pass
  
  else:
    for row in rows_lst:
      process_row(row)
  
  print(f"\n  time: {(time.time()-st0)/60.:.1f}min\n", flush=True)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 20 Feb 2025 00:19:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/using-concurrent-futures-for-parallelization/m-p/110674#M43638</guid>
      <dc:creator>ironv</dc:creator>
      <dc:date>2025-02-20T00:19:09Z</dc:date>
    </item>
    <item>
      <title>Re: using concurrent.futures for parallelization</title>
      <link>https://community.databricks.com/t5/data-engineering/using-concurrent-futures-for-parallelization/m-p/137041#M50696</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The "SparkSession$ does not exist in the JVM" error in your scenario is&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;almost always due to the use of multiprocessing (like&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;ProcessPoolExecutor&lt;/CODE&gt;) with Spark&lt;/STRONG&gt;. Spark contexts and sessions cannot safely be shared across processes, especially in Databricks or PySpark environments, because the JVM and SparkContext are not fork-safe and cannot be serialized and sent to child processes reliably.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Why This Happens&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;SparkSession is JVM-bound&lt;/STRONG&gt;: Each process launched by&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;ProcessPoolExecutor&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;spins up its own Python interpreter, and tries to access the SparkSession that was created in the parent. This won't work; the session cannot be forked/copied into child JVMs.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;PySpark/Databricks does not support multiprocessing&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;using Python's&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;multiprocessing&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;or&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;ProcessPoolExecutor&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;for Spark jobs. Each process must independently create its own SparkSession, which can cause resource contention, failures, and the SparkSession$ error you're seeing.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Sequential operation works&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;because only one Python process and the main Spark JVM are used; no parallelism implies no cross-process sharing issues.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;How to Fix&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Use thread-based, not process-based parallelism:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Replace&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;ProcessPoolExecutor&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;ThreadPoolExecutor&lt;/CODE&gt;. Spark is not forkable but can tolerate concurrent threads within the same driver process — provided thread safety is handled.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Alternatively, use Spark's own parallelism through DataFrame partitioning or&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;mapPartitions&lt;/CODE&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Example Fix&lt;/H2&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;from&lt;/SPAN&gt; concurrent&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;futures &lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; ThreadPoolExecutor&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; as_completed
&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;
&lt;SPAN class="token token"&gt;if&lt;/SPAN&gt; prm&lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"NUM_PARALLEL"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt; &lt;SPAN class="token token operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;1&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
    &lt;SPAN class="token token"&gt;with&lt;/SPAN&gt; ThreadPoolExecutor&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;max_workers&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;prm&lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"NUM_PARALLEL"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;as&lt;/SPAN&gt; executor&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
        job_list &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;executor&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;submit&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;process_row&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; row&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;for&lt;/SPAN&gt; row &lt;SPAN class="token token"&gt;in&lt;/SPAN&gt; rows_lst&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;for&lt;/SPAN&gt; job &lt;SPAN class="token token"&gt;in&lt;/SPAN&gt; as_completed&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;job_list&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
            &lt;SPAN class="token token"&gt;pass&lt;/SPAN&gt;
&lt;SPAN class="token token"&gt;else&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
    &lt;SPAN class="token token"&gt;for&lt;/SPAN&gt; row &lt;SPAN class="token token"&gt;in&lt;/SPAN&gt; rows_lst&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
        process_row&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;row&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;This should eliminate the "SparkSession$" error because all threads share the same process, JVM, and Spark context.&lt;/P&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Important Caveats:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Make sure your homegrown library and all Spark interactions are thread-safe.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Limit the level of parallelism (&lt;CODE&gt;NUM_PARALLEL&lt;/CODE&gt;) to not overwhelm the Spark driver.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Consider possible GIL limitations for non-Spark workloads.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Alternatives&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Leverage Spark native partitioning&lt;/STRONG&gt;: Run a single distributed Spark job that pulls data in partitions.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Use Databricks workflows&lt;/STRONG&gt;: For large-scale orchestrations, Databricks Jobs can run tasks in parallel safely.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;References&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;: Learn why SparkSession errors happen with multiprocessing and how to fix them&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;: Official Databricks documentation on SparkSession lifecycle and forks&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Switching to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;ThreadPoolExecutor&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;will let your parallelism work without causing SparkSession errors. For industrial-scale data loads, native Spark parallelism is preferable.&lt;/P&gt;</description>
      <pubDate>Fri, 31 Oct 2025 15:24:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/using-concurrent-futures-for-parallelization/m-p/137041#M50696</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-10-31T15:24:51Z</dc:date>
    </item>
  </channel>
</rss>

