topic Re: Performance issue while loading bulk data into Post Gress DB from data bricks. in Data Engineering

Performance issue while loading bulk data into Post Gress DB from data bricks.

Phani1 — Thu, 02 Mar 2023 05:40:00 GMT

We are facing a performance issue while loading bulk data into Postgress DB from data bricks. We are using spark JDBC connections to move the data. However, the rate of transfer is very low which is causing performance bottleneck. is there any better approach to achieve this task?

Re: Performance issue while loading bulk data into Post Gress DB from data bricks.

daniel_sahal — Fri, 03 Mar 2023 06:41:55 GMT

@Janga Reddy

I remember that we had this kind of question before. Switching to another library partially solved the issue.

https://community.databricks.com/s/question/0D58Y00009ia8JpSAI/getting-error-while-loading-parquet-data-into-postgres-using-sparkpostgres-library-classnotfoundexception-failed-to-find-data-source-postgres-please-find-packages-at-httpsparkapacheorgthirdpartyprojectshtml-caused-by-classnotfoundexception

Re: Performance issue while loading bulk data into Post Gress DB from data bricks.

Anonymous — Tue, 21 Mar 2023 06:57:29 GMT

Hi @Janga Reddy

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Re: Performance issue while loading bulk data into Post Gress DB from data bricks.

User16502773013 — Thu, 30 Mar 2023 02:30:59 GMT

Hello @Janga Reddy @Daniel Sahal and @Vidula Khanna ,

To enhance performance in general we need to design for more parallelism, in Spark JDBC context this controlled by the number of partitions for the data to be written

The example here shows how to control parallelism while writing which is driven by numPartitions during read , while numPartitions is a Spark JDBC read parameter, the same can be done on a dataframe using repartition (documentation here)

It is worth mentioning that parallel reads/writes can put pressure on the RDBMS (Postgres in this case) meaning while Spark write can happen in parallel, the sizing/capacity/connectivity of the destination database should be taken into account and should be evaluated.

Regards