cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Reading from one Postgres table result in several Scan JDBCRelation operations

lieber_augustin
New Contributor

Hello,

I am working on a Spark job where I'm reading several tables from PostgreSQL into DataFrames as follows:

 

df = (spark.read
        .format("postgresql")
        .option("query", query)
        .option("host", database_host)
        .option("port", database_port)
        .option("database", database_name)
        .option("user", user)
        .option("password", password)
        .option("fetchsize", 500000)
        .load()
        )
df = df.cache()

 

Unfortunately, I cannot partition the tables during the read because the columns I would like to partition by are VARCHARs. When I add .option("numPartitions", partitions) to the code above, the execution plan still indicates a single partition:

 

Scan JDBCRelation((SELECT party_name, party_id, company_id, deleted_at, updated_at FROM public.table) SPARK_GEN_SUBQ_3808) [numPartitions=1] (1)

 

I'm selecting between 1 to 10 columns from 6 tables. The largest table contains approximately 200,000,000 rows. The job involves several complex operations, including dozens of joins and multiple aggregations. It runs slowly, particularly during the reading phase.

In execution plan, I noticed that each table involves multiple Scan JDBCRelation operations. For example, one of the PostgreSQL read queries, which lacks any WHERE clauses, results in 28 Scan JDBC operations.

Could anyone suggest potential optimizations for reading from PostgreSQL? Additionally, could you explain why there are multiple Scan JDBCs for a single source table?

Thank you for your assistance!

 

Please see Execution Plan in PDF attached to this post

or check it on pastebin (paste password: eZtui0teAL) - https://pastebin.com/bviEiX73

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @lieber_augustinOptimizing the performance of your PostgreSQL queries involves several considerations.

Let’s address both the potential optimizations and the reason behind multiple Scan JDBCRelation operations.

  1. Database Design:

    • Properly designing your database schema is crucial for performance. Consider partitioning your data into logically separated tables instead of having one large table. This can significantly improve query performance.
    • Ensure that you have appropriate indexes. Indexes help filter data efficiently, especially for frequently executed queries. Create indexes on columns commonly used as filters in your queries1.
  2. Query Optimization:

  3. Batch Processing:

  4. Parallelism and Partitioning:

    • Since you mentioned that you cannot partition the tables during read due to VARCHAR columns, consider other ways to parallelize the process.
    • You can manually split the data into smaller chunks (based on some criteria) and read them in parallel. For example, if your VARCHAR columns have a natural grouping, you can read subsets of data concurrently.
    • Additionally, explore Spark’s parallelism settings. You can adjust the number of partitions based on available resources and the size of your data. However, be cautious not to create too many partitions, as it can lead to overhead1.
  5. JDBC Batch Execution:

    • When reading data from PostgreSQL, Spark uses JDBC to fetch data in chunks. Each chunk is a separate Scan JDBCRelation operation.
    • The reason you see multiple Scan JDBC operations for a single source table is because Spark fetches data in batches. It reads a chunk of data, processes it, and then fetches the next chunk.
    • This approach helps manage memory efficiently and allows Spark to handle large datasets without loading everything into memory at once.
    • The number of Scan JDBC operations depends on the fetch size, available memory, and the size of your...1.

In summary, focus on database design, query optimization, and parallelism to improve your PostgreSQL query performance. Experiment with different settings and monitor the impact on execution time. Remember that tuning PostgreSQL for performance is unique compared to other databases, so tailor you...1.

Feel free to explore the suggested optimizations and adapt them to your Spark job. If you have any further questions or need additional assistance, feel free to ask! 😊

 
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!