Optimized option to write updates to Aurora PostgresDB from Databricks/spark

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Hello All,

We want to update our postgres tables from our spark structured streaming workflow on Databricks. We are using foreachbatch utility to write to this sink. I want to understand an optimized way to do this at near real time latency avoiding deadlocks or improving concurrency and parallelism. Right now we are considering 2 options:

1. JDBC connector/pyscopg2: where we try to run updates directly on the main table in postgres but we are not utilizing parallelism here appropriately and inserting one record at a time(batchsize=1) for the fear of loosing/delaying other records in the batch in case of a failed record in the batch. This is also increasing latency which we do not desire.

2. Append to temp table: In this approach we are creating a temp table on postgres (for every table we want to update) and then running merge to the actual table via trigger on the postgres end.

Option 2 have been working well for us so far but I want to reach out to others here and experts on the forum to understand if there is any better approach for this or any suggestions on our approach to optimize and achieve real time streaming frequency.

Any information is highly appreciated.

Thanks in advance for your response.

Sweta