<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Load tables from JDBC in parallel in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/83530#M36952</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/115023"&gt;@ADB0513&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can use dictionary instead of tuple and you can pass your value as multiple key and value pair&lt;/P&gt;</description>
    <pubDate>Tue, 20 Aug 2024 05:37:53 GMT</pubDate>
    <dc:creator>Ajay-Pandey</dc:creator>
    <dc:date>2024-08-20T05:37:53Z</dc:date>
    <item>
      <title>Load tables from JDBC in parallel</title>
      <link>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/82885#M36764</link>
      <description>&lt;P&gt;I have a list of about 80 tables that I need to load from an Oracle database into Databricks via JDBC.&amp;nbsp; I would like to do this in parallel, instead of looping through one table at a time.&lt;/P&gt;&lt;P&gt;I have a function defined to ingest the data:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;def&lt;/SPAN&gt; &lt;SPAN&gt;ingest_data&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;database&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;primary_key&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;There is some logic in my function, but the main part of it is loading the dataframe from JDBC and writing to a table in Databricks:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; spark.read \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"jdbc"&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"url"&lt;/SPAN&gt;&lt;SPAN&gt;, url) \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"dbtable"&lt;/SPAN&gt;&lt;SPAN&gt;, table) \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"user"&lt;/SPAN&gt;&lt;SPAN&gt;, user) \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"password"&lt;/SPAN&gt;&lt;SPAN&gt;, password) \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"driver"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"oracle.jdbc.driver.OracleDriver"&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"partitionColumn"&lt;/SPAN&gt;&lt;SPAN&gt;, primary_key) \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"lowerBound"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt;&lt;SPAN&gt;(lowerBound)) \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"upperBound"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt;&lt;SPAN&gt;(upperBound)) \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"numPartitions"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;8&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df.write.&lt;/SPAN&gt;&lt;SPAN&gt;mode&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;saveAsTable&lt;/SPAN&gt;&lt;SPAN&gt;(table_name)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;What is the best way to call this function multiple times in parallel?&amp;nbsp; I have looked at ThreadPoolExecutor, but have never used it.&amp;nbsp; Also, I'm not sure how I would call it given that my function has more than 1 parameter.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Any help is appreciated.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 13 Aug 2024 15:20:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/82885#M36764</guid>
      <dc:creator>ADB0513</dc:creator>
      <dc:date>2024-08-13T15:20:35Z</dc:date>
    </item>
    <item>
      <title>Re: Load tables from JDBC in parallel</title>
      <link>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/82949#M36789</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/115023"&gt;@ADB0513&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;Yes you can use &lt;SPAN&gt;ThreadPoolExecutor and also databricks planning to create for_each activity that you can use to call same notebook multiple time with different parameters.&lt;BR /&gt;&lt;BR /&gt;please find attached code for your reference -&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from concurrent.futures import ThreadPoolExecutor

class NotebookData:

  def __init__(self, path, timeout, parameters = None, retry = 0):

    self.path = path
    self.timeout = timeout
    self.parameters = parameters
    self.retry = retry

  def submit_notebook(notebook):
    # print("Running notebook for Table : %s " % (notebook.parameters['tableName']))
    try:
      if (notebook.parameters):
        return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
      else:
        return dbutils.notebook.run(notebook.path, notebook.timeout)
    except Exception as e:
       if notebook.retry &amp;lt; 1:
        print("Failed For Notebook: ",notebook.parameters)
        raise
      
    # print("Retrying notebook for Table : %s " % (notebook.parameters['tableName']))
    notebook.retry = notebook.retry - 1
    submit_notebook(notebook)

def parallel_notebooks(notebooks, parallel_thread):
    """
        If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once.
        This code limits the number of parallel notebooks.
    """
    with ThreadPoolExecutor(max_workers = parallel_thread) as ec:
        return [ec.submit(NotebookData.submit_notebook, notebook) for notebook in notebooks]




notebooks = [NotebookData(notebook_path, 3600, {"param1" : f'{param1}'}) for param1 in var_dict.values()]   

# #Array of instances of NotebookData Class
parallel_thread = 60

try : 
    res = parallel_notebooks(notebooks, parallel_thread)
    result = [i.result(timeout = 3600) for i in res] # This is a blocking call.
    print(result)  
except NameError as e :
    print(e)&lt;/LI-CODE&gt;</description>
      <pubDate>Wed, 14 Aug 2024 10:27:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/82949#M36789</guid>
      <dc:creator>Ajay-Pandey</dc:creator>
      <dc:date>2024-08-14T10:27:21Z</dc:date>
    </item>
    <item>
      <title>Re: Load tables from JDBC in parallel</title>
      <link>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/82964#M36793</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/115023"&gt;@ADB0513&lt;/a&gt;, Thanks for reaching out! Please review the response and let us know if it answers your question. Your feedback is valuable to us and the community.&lt;/P&gt;
&lt;P&gt;If the response resolves your issue, kindly mark it as the accepted solution. This will help close the thread and assist others with similar queries.&lt;/P&gt;
&lt;P&gt;We appreciate your participation and are here if you need further assistance!&lt;/P&gt;</description>
      <pubDate>Wed, 14 Aug 2024 11:48:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/82964#M36793</guid>
      <dc:creator>Retired_mod</dc:creator>
      <dc:date>2024-08-14T11:48:01Z</dc:date>
    </item>
    <item>
      <title>Re: Load tables from JDBC in parallel</title>
      <link>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/83241#M36887</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/71565"&gt;@Ajay-Pandey&lt;/a&gt;&amp;nbsp; &amp;nbsp;Thank you for this.&amp;nbsp; Here is some code snippets of what I am working on.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;table_list = [ { 'table' : ' table1'}, {'table' : 'table2'}, {'table' : 'table3'}]&lt;/LI-CODE&gt;&lt;LI-CODE lang="python"&gt;import concurrent.futures

def run_notebook(table):
    dbutils.notebook.run("./Notebook", 0, table)

max_workers = 5

with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
    executor.map(run_notebook, table_list)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;In my notebook that is being called I have a widget to accept the table value.&amp;nbsp; How can I do this if I want to pass in a tuple to my notebook?&amp;nbsp; So something like this?&lt;/P&gt;&lt;LI-CODE lang="python"&gt;table_list = [ { 'table' : ' table1', 'attribute1' : 'value1', 'attribute2' : 'value2'}, {'table' : 'table2', 'attribute1' : 'value8', 'attribute2' : 'value19'}, {'table' : 'table3', 'attribute1' : 'value17', 'attribute2' : 'value475'}]&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp; How can I have the widget accept a tuple or how can I convert the string value of the widget to a tuple?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Aug 2024 15:16:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/83241#M36887</guid>
      <dc:creator>ADB0513</dc:creator>
      <dc:date>2024-08-16T15:16:04Z</dc:date>
    </item>
    <item>
      <title>Re: Load tables from JDBC in parallel</title>
      <link>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/83530#M36952</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/115023"&gt;@ADB0513&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can use dictionary instead of tuple and you can pass your value as multiple key and value pair&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2024 05:37:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/83530#M36952</guid>
      <dc:creator>Ajay-Pandey</dc:creator>
      <dc:date>2024-08-20T05:37:53Z</dc:date>
    </item>
    <item>
      <title>Re: Load tables from JDBC in parallel</title>
      <link>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/83702#M36990</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/115023"&gt;@ADB0513&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now Databricks support for_each that means you don't have to use multi threading you can directly use for_each&amp;nbsp;&lt;BR /&gt;&lt;A href="https://community.databricks.com/t5/knowledge-sharing-hub/boost-your-data-pipelines-with-dynamic-data-driven-databricks/td-p/83701" target="_self"&gt;for_each activity&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Aug 2024 04:20:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-tables-from-jdbc-in-parallel/m-p/83702#M36990</guid>
      <dc:creator>Ajay-Pandey</dc:creator>
      <dc:date>2024-08-21T04:20:25Z</dc:date>
    </item>
  </channel>
</rss>

