cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimized option to write updates to Aurora PostgresDB from Databricks/spark

Sweta
New Contributor II

Hello All,

    We want to update our postgres tables from our spark structured streaming workflow on Databricks. We are using foreachbatch utility to write to this sink. I want to understand an optimized way to do this at near real time latency avoiding deadlocks or improving concurrency and parallelism. Right now we are considering 2 options:

1. JDBC connector/pyscopg2: where we try to run updates directly on the main table in postgres but we are not utilizing parallelism here appropriately and inserting one record at a time(batchsize=1) for the fear of loosing/delaying other records in the batch in case of a failed record in the batch. This is also increasing latency which we do not desire.

2. Append to temp table: In this approach we are creating a temp table on postgres (for every table we want to update) and then running merge to the actual table via trigger on the postgres end. 

Option 2 have been working well for us so far but I want to reach out to others here and experts on the forum to understand if there is any better approach for this or any suggestions on our approach to optimize and achieve real time streaming frequency. 

Any information is highly appreciated.

Thanks in advance for your response.

Sweta

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @Sweta, Your question about optimizing updates to PostgreSQL tables from a Spark structured streaming workflow is quite relevant, and I’m glad you’re exploring different approaches.

  • Running updates directly on the main table in PostgreSQL using the JDBC connector is a common approach. However, as you’ve noted, it can be challenging to achieve parallelism and low latency.
  • The issue with inserting one record at a time (batch size = 1) is that it can lead to increased latency due to network round-trips and database overhead.
  • To improve this approach, consider increasing the batch size to a reasonable value (greater than 1) to reduce the impact of individual record failures. You can also explore connection pooling to enhance parallelism.
  • Your second option involves creating a temporary table in PostgreSQL for each table you want to update. Then, you merge the data from the temp table into the actual table using a trigger.
  • This approach can be effective because it allows you to batch-process updates and perform the merge operation efficiently.
  • However, it does introduce additional complexity with managing temp tables and triggers.
  • Make sure to handle any potential race conditions or conflicts during the merge process.
  • When using stateful streaming queries, state updates can bottleneck performance. Consider enabling asynchronous checkpointing to make state updates asynchronous to micro-batch processing. Keep in mind that this tradeoff may result in slower restarts1.
  • Consider periodically restarting your Spark streaming job to handle old files and manage input/output retention periods. This can help maintain performance and prevent resource leaks2.
 
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!