<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Reading from one Postgres table result in several Scan JDBCRelation operations in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/reading-from-one-postgres-table-result-in-several-scan/m-p/68079#M33548</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am working on a Spark job where I'm reading several tables from PostgreSQL into DataFrames as follows:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = (spark.read
        .format("postgresql")
        .option("query", query)
        .option("host", database_host)
        .option("port", database_port)
        .option("database", database_name)
        .option("user", user)
        .option("password", password)
        .option("fetchsize", 500000)
        .load()
        )
df = df.cache()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Unfortunately, I cannot partition the tables during the read because the columns I would like to partition by are VARCHARs. When I add .option("numPartitions", partitions) to the code above, the execution plan still indicates a single partition:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Scan JDBCRelation((SELECT party_name, party_id, company_id, deleted_at, updated_at FROM public.table) SPARK_GEN_SUBQ_3808) [numPartitions=1] (1)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm selecting between 1 to 10 columns from 6 tables. The largest table contains approximately 200,000,000 rows. The job involves several complex operations, including dozens of joins and multiple aggregations. It runs slowly, particularly during the reading phase.&lt;/P&gt;&lt;P&gt;In execution plan, I noticed that each table involves multiple Scan JDBCRelation operations. For example, one of the PostgreSQL read queries, which lacks any WHERE clauses, results in 28 Scan JDBC operations.&lt;/P&gt;&lt;P&gt;Could anyone suggest potential optimizations for reading from PostgreSQL? Additionally, could you explain why there are multiple Scan JDBCs for a single source table?&lt;/P&gt;&lt;P&gt;Thank you for your assistance!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please see Execution Plan in PDF attached to this post&lt;/P&gt;&lt;P&gt;or check it on pastebin (paste password:&amp;nbsp;eZtui0teAL) -&amp;nbsp;&lt;A href="https://pastebin.com/bviEiX73" target="_blank" rel="noopener"&gt;https://pastebin.com/bviEiX73&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Sat, 04 May 2024 02:18:45 GMT</pubDate>
    <dc:creator>lieber_augustin</dc:creator>
    <dc:date>2024-05-04T02:18:45Z</dc:date>
    <item>
      <title>Reading from one Postgres table result in several Scan JDBCRelation operations</title>
      <link>https://community.databricks.com/t5/data-engineering/reading-from-one-postgres-table-result-in-several-scan/m-p/68079#M33548</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am working on a Spark job where I'm reading several tables from PostgreSQL into DataFrames as follows:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = (spark.read
        .format("postgresql")
        .option("query", query)
        .option("host", database_host)
        .option("port", database_port)
        .option("database", database_name)
        .option("user", user)
        .option("password", password)
        .option("fetchsize", 500000)
        .load()
        )
df = df.cache()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Unfortunately, I cannot partition the tables during the read because the columns I would like to partition by are VARCHARs. When I add .option("numPartitions", partitions) to the code above, the execution plan still indicates a single partition:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Scan JDBCRelation((SELECT party_name, party_id, company_id, deleted_at, updated_at FROM public.table) SPARK_GEN_SUBQ_3808) [numPartitions=1] (1)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm selecting between 1 to 10 columns from 6 tables. The largest table contains approximately 200,000,000 rows. The job involves several complex operations, including dozens of joins and multiple aggregations. It runs slowly, particularly during the reading phase.&lt;/P&gt;&lt;P&gt;In execution plan, I noticed that each table involves multiple Scan JDBCRelation operations. For example, one of the PostgreSQL read queries, which lacks any WHERE clauses, results in 28 Scan JDBC operations.&lt;/P&gt;&lt;P&gt;Could anyone suggest potential optimizations for reading from PostgreSQL? Additionally, could you explain why there are multiple Scan JDBCs for a single source table?&lt;/P&gt;&lt;P&gt;Thank you for your assistance!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please see Execution Plan in PDF attached to this post&lt;/P&gt;&lt;P&gt;or check it on pastebin (paste password:&amp;nbsp;eZtui0teAL) -&amp;nbsp;&lt;A href="https://pastebin.com/bviEiX73" target="_blank" rel="noopener"&gt;https://pastebin.com/bviEiX73&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 04 May 2024 02:18:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/reading-from-one-postgres-table-result-in-several-scan/m-p/68079#M33548</guid>
      <dc:creator>lieber_augustin</dc:creator>
      <dc:date>2024-05-04T02:18:45Z</dc:date>
    </item>
  </channel>
</rss>

