<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Performance issue while loading bulk data into Post Gress DB from data bricks. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/performance-issue-while-loading-bulk-data-into-post-gress-db/m-p/8420#M4068</link>
    <description>&lt;P&gt;Hello @Janga Reddy​&amp;nbsp;@Daniel Sahal​&amp;nbsp;and @Vidula Khanna​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To enhance performance in general we need to design for more parallelism, in Spark JDBC context this controlled by the number of partitions for the data to be written&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The example &lt;A href="https://docs.databricks.com/external-data/jdbc.html#control-parallelism-for-jdbc-queries" alt="https://docs.databricks.com/external-data/jdbc.html#control-parallelism-for-jdbc-queries" target="_blank"&gt;here&lt;/A&gt; shows how to control parallelism while writing which is driven by numPartitions during read , while numPartitions is a Spark JDBC read parameter, the same can be done on a dataframe using repartition (documentation &lt;A href="https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.repartition.html" alt="https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.repartition.html" target="_blank"&gt;here&lt;/A&gt;)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It is worth mentioning that parallel reads/writes can put pressure on the RDBMS (Postgres in this case) meaning while Spark write can happen in parallel, the sizing/capacity/connectivity of the destination database should be taken into account and should be evaluated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;</description>
    <pubDate>Thu, 30 Mar 2023 02:30:59 GMT</pubDate>
    <dc:creator>User16502773013</dc:creator>
    <dc:date>2023-03-30T02:30:59Z</dc:date>
    <item>
      <title>Performance issue while loading bulk data into Post Gress DB from data bricks.</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-while-loading-bulk-data-into-post-gress-db/m-p/8417#M4065</link>
      <description>&lt;P&gt;We are facing a performance issue while loading &lt;B&gt;bulk data into Postgress DB from data bricks&lt;/B&gt;. We are using spark JDBC connections to move the data. However, the rate of transfer is very low which is causing performance bottleneck. is there any better approach to achieve this task?&lt;/P&gt;</description>
      <pubDate>Thu, 02 Mar 2023 05:40:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-while-loading-bulk-data-into-post-gress-db/m-p/8417#M4065</guid>
      <dc:creator>Phani1</dc:creator>
      <dc:date>2023-03-02T05:40:00Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue while loading bulk data into Post Gress DB from data bricks.</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-while-loading-bulk-data-into-post-gress-db/m-p/8418#M4066</link>
      <description>&lt;P&gt;@Janga Reddy​&amp;nbsp;&lt;/P&gt;&lt;P&gt;I remember that we had this kind of question before. Switching to another library partially solved the issue.&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.databricks.com/s/question/0D58Y00009ia8JpSAI/getting-error-while-loading-parquet-data-into-postgres-using-sparkpostgres-library-classnotfoundexception-failed-to-find-data-source-postgres-please-find-packages-at-httpsparkapacheorgthirdpartyprojectshtml-caused-by-classnotfoundexception" target="test_blank"&gt;https://community.databricks.com/s/question/0D58Y00009ia8JpSAI/getting-error-while-loading-parquet-data-into-postgres-using-sparkpostgres-library-classnotfoundexception-failed-to-find-data-source-postgres-please-find-packages-at-httpsparkapacheorgthirdpartyprojectshtml-caused-by-classnotfoundexception&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Mar 2023 06:41:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-while-loading-bulk-data-into-post-gress-db/m-p/8418#M4066</guid>
      <dc:creator>daniel_sahal</dc:creator>
      <dc:date>2023-03-03T06:41:55Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue while loading bulk data into Post Gress DB from data bricks.</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-while-loading-bulk-data-into-post-gress-db/m-p/8419#M4067</link>
      <description>&lt;P&gt;Hi @Janga Reddy​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Mar 2023 06:57:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-while-loading-bulk-data-into-post-gress-db/m-p/8419#M4067</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-03-21T06:57:29Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue while loading bulk data into Post Gress DB from data bricks.</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-while-loading-bulk-data-into-post-gress-db/m-p/8420#M4068</link>
      <description>&lt;P&gt;Hello @Janga Reddy​&amp;nbsp;@Daniel Sahal​&amp;nbsp;and @Vidula Khanna​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To enhance performance in general we need to design for more parallelism, in Spark JDBC context this controlled by the number of partitions for the data to be written&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The example &lt;A href="https://docs.databricks.com/external-data/jdbc.html#control-parallelism-for-jdbc-queries" alt="https://docs.databricks.com/external-data/jdbc.html#control-parallelism-for-jdbc-queries" target="_blank"&gt;here&lt;/A&gt; shows how to control parallelism while writing which is driven by numPartitions during read , while numPartitions is a Spark JDBC read parameter, the same can be done on a dataframe using repartition (documentation &lt;A href="https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.repartition.html" alt="https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.repartition.html" target="_blank"&gt;here&lt;/A&gt;)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It is worth mentioning that parallel reads/writes can put pressure on the RDBMS (Postgres in this case) meaning while Spark write can happen in parallel, the sizing/capacity/connectivity of the destination database should be taken into account and should be evaluated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2023 02:30:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-while-loading-bulk-data-into-post-gress-db/m-p/8420#M4068</guid>
      <dc:creator>User16502773013</dc:creator>
      <dc:date>2023-03-30T02:30:59Z</dc:date>
    </item>
  </channel>
</rss>

