<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Is it possible to use multiprocessing or threads to submit multiple queries to a database from Databricks in parallel? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17403#M11428</link>
    <description>&lt;P&gt;Spark, by default, works in a distributed way and does the process in parallel using native spark.&lt;/P&gt;&lt;P&gt;It will use all cores; every core will process one partition and write it to the database.&lt;/P&gt;&lt;P&gt;The same with selec.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So for example you can use the below code to read SQL in parallel:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;table = (spark.read
  .format("jdbc")
  .option("url", "&amp;lt;jdbc_url&amp;gt;")
  .option("dbtable", "&amp;lt;table_name&amp;gt;")
  .option("user", "&amp;lt;username&amp;gt;")
  .option("password", "&amp;lt;password&amp;gt;")
  # a column that can be used that has a uniformly distributed range of values that can be used for parallelization
  .option("partitionColumn", "&amp;lt;partition_key&amp;gt;")
  # lowest value to pull data for with the partitionColumn
  .option("lowerBound", "&amp;lt;min_value&amp;gt;")
  # max value to pull data for with the partitionColumn
  .option("upperBound", "&amp;lt;max_value&amp;gt;")
  # number of partitions to distribute the data into. Use sc.defaultParallelism to set to number of cores
  .option("numPartitions", sc.defaultParallelism)
  .load()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Sat, 10 Dec 2022 21:30:53 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2022-12-10T21:30:53Z</dc:date>
    <item>
      <title>Is it possible to use multiprocessing or threads to submit multiple queries to a database from Databricks in parallel?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17402#M11427</link>
      <description>&lt;P&gt;We are trying to improve our overall runtime by running queries in parallel using either multiprocessing or threads.  What I am seeing though is that when the function that runs this code is run on a separate process it doesnt return a dataFrame with any info.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sample Code&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;def GetData(job_context, group_numbers):
    sqlquery = """
                SELECT *
                FROM
                    recon.v_ods_payment
                    WHERE group_number in({0})""".format(group_numbers)
       
    payments_df = job_context.create_dataframe_from_staging(sqlquery)
    print(payments_df)
    #display(payments_df)
    return payments_df
&amp;nbsp;
&amp;nbsp;
def DoWork():
job_list = ["payments.payments_job"]
&amp;nbsp;
for job in job_list:
        try:            
          p = multiprocessing.Process(target=GetData, args=(job_Context,"0030161",))
          p.start()      
          processes.append(p)    
        except Exception as ex:            
            status = "CL_Failed"
            message = f"The job {job} failed due an exception. Please refer to Databricks for more information.  Error: {repr(ex)}"
            job_context.logger_error(message)
            raise ex
       
for p in processes:
  p.join() &lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The error I can see in the logs is as follows&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;  File "&amp;lt;command-2328844487947527&amp;gt;", line 13, in GetData
    print(payments_df)
  File "/databricks/spark/python/pyspark/sql/dataframe.py", line 452, in __repr__
    return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes))
  File "/databricks/spark/python/pyspark/sql/dataframe.py", line 1040, in dtypes
    return [(str(f.name), f.dataType.simpleString()) for f in self.schema.fields]
  File "/databricks/spark/python/pyspark/sql/dataframe.py", line 257, in schema
    self._schema = _parse_datatype_json_string(self._jdf.schema().json())
  File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/databricks/spark/python/pyspark/sql/utils.py", line 127, in deco
    return f(*a, **kw)
  File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value
    format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling o1361.schema. Trace:
py4j.Py4JException: Method schema([]) does not exist&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This runs totally fine if not running on a separate process&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any ideas or help would be greatly appreciated&lt;/P&gt;</description>
      <pubDate>Sat, 10 Dec 2022 15:31:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17402#M11427</guid>
      <dc:creator>vanepet</dc:creator>
      <dc:date>2022-12-10T15:31:36Z</dc:date>
    </item>
    <item>
      <title>Re: Is it possible to use multiprocessing or threads to submit multiple queries to a database from Databricks in parallel?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17403#M11428</link>
      <description>&lt;P&gt;Spark, by default, works in a distributed way and does the process in parallel using native spark.&lt;/P&gt;&lt;P&gt;It will use all cores; every core will process one partition and write it to the database.&lt;/P&gt;&lt;P&gt;The same with selec.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So for example you can use the below code to read SQL in parallel:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;table = (spark.read
  .format("jdbc")
  .option("url", "&amp;lt;jdbc_url&amp;gt;")
  .option("dbtable", "&amp;lt;table_name&amp;gt;")
  .option("user", "&amp;lt;username&amp;gt;")
  .option("password", "&amp;lt;password&amp;gt;")
  # a column that can be used that has a uniformly distributed range of values that can be used for parallelization
  .option("partitionColumn", "&amp;lt;partition_key&amp;gt;")
  # lowest value to pull data for with the partitionColumn
  .option("lowerBound", "&amp;lt;min_value&amp;gt;")
  # max value to pull data for with the partitionColumn
  .option("upperBound", "&amp;lt;max_value&amp;gt;")
  # number of partitions to distribute the data into. Use sc.defaultParallelism to set to number of cores
  .option("numPartitions", sc.defaultParallelism)
  .load()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 10 Dec 2022 21:30:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17403#M11428</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-12-10T21:30:53Z</dc:date>
    </item>
    <item>
      <title>Re: Is it possible to use multiprocessing or threads to submit multiple queries to a database from Databricks in parallel?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17404#M11429</link>
      <description>&lt;P&gt;How would I achieve this is I wanted to run queries in parallell for different tables?&lt;/P&gt;&lt;P&gt;Like &lt;/P&gt;&lt;P&gt;DF = Select * from table a where group =1&lt;/P&gt;&lt;P&gt;DF2 = Select * from table b where group =1&lt;/P&gt;</description>
      <pubDate>Sat, 10 Dec 2022 23:57:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17404#M11429</guid>
      <dc:creator>vanepet</dc:creator>
      <dc:date>2022-12-10T23:57:15Z</dc:date>
    </item>
    <item>
      <title>Re: Is it possible to use multiprocessing or threads to submit multiple queries to a database from Databricks in parallel?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17405#M11430</link>
      <description>&lt;P&gt;@Peter Vanelli​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Check if this helps&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://medium.com/analytics-vidhya/horizontal-parallelism-with-pyspark-d05390aa1df5" target="test_blank"&gt;https://medium.com/analytics-vidhya/horizontal-parallelism-with-pyspark-d05390aa1df5&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 11 Dec 2022 10:28:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17405#M11430</guid>
      <dc:creator>Soma</dc:creator>
      <dc:date>2022-12-11T10:28:04Z</dc:date>
    </item>
    <item>
      <title>Re: Is it possible to use multiprocessing or threads to submit multiple queries to a database from Databricks in parallel?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17406#M11431</link>
      <description>&lt;P&gt;This might help &lt;A href="https://dustinvannoy.com/2022/05/06/parallel-ingest-spark-notebook/" alt="https://dustinvannoy.com/2022/05/06/parallel-ingest-spark-notebook/" target="_blank"&gt;https://dustinvannoy.com/2022/05/06/parallel-ingest-spark-notebook/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Write a function to pass your table list via array, into a queue for threading&lt;/P&gt;</description>
      <pubDate>Tue, 13 Dec 2022 04:51:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/17406#M11431</guid>
      <dc:creator>huyd</dc:creator>
      <dc:date>2022-12-13T04:51:41Z</dc:date>
    </item>
    <item>
      <title>Re: Is it possible to use multiprocessing or threads to submit multiple queries to a database from D</title>
      <link>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/65845#M32937</link>
      <description>&lt;P&gt;Thanks for the links mentioned above. But both of them uses raw python to achieve parallelism. Does this mean Spark (read PySpark) does exactly provisions for parallel execution of functions or even notebooks ?&amp;nbsp;&lt;/P&gt;&lt;P&gt;We used a wrapper notebook with ThreadPoolExecutor to get our job done. This one is purely Python.&amp;nbsp; The wrapper notebook spawning the same child notebook with different parameters. These child notebooks are using Spark Dataframes SQL. And it is running and all I can say is then we are not exactly unhappy with it. It helped even more when scaled our environment horizontally. Four clusters, each loading around 200-250 files to Delta Lake. In totality around 24 GB of Data within 2-3 hours, in incremental fashion.&lt;/P&gt;&lt;P&gt;Guess we have to live with that till Spark comes up with something in this area.&lt;/P&gt;</description>
      <pubDate>Tue, 09 Apr 2024 04:37:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-it-possible-to-use-multiprocessing-or-threads-to-submit/m-p/65845#M32937</guid>
      <dc:creator>BapsDBS</dc:creator>
      <dc:date>2024-04-09T04:37:51Z</dc:date>
    </item>
  </channel>
</rss>

