cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimized option to write updates to Aurora PostgresDB from Databricks/spark

Sweta
New Contributor II

Hello All,

    We want to update our postgres tables from our spark structured streaming workflow on Databricks. We are using foreachbatch utility to write to this sink. I want to understand an optimized way to do this at near real time latency avoiding deadlocks or improving concurrency and parallelism. Right now we are considering 2 options:

1. JDBC connector/pyscopg2: where we try to run updates directly on the main table in postgres but we are not utilizing parallelism here appropriately and inserting one record at a time(batchsize=1) for the fear of loosing/delaying other records in the batch in case of a failed record in the batch. This is also increasing latency which we do not desire.

2. Append to temp table: In this approach we are creating a temp table on postgres (for every table we want to update) and then running merge to the actual table via trigger on the postgres end. 

Option 2 have been working well for us so far but I want to reach out to others here and experts on the forum to understand if there is any better approach for this or any suggestions on our approach to optimize and achieve real time streaming frequency. 

Any information is highly appreciated.

Thanks in advance for your response.

Sweta

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group