Databricks Community

MarsSu · ‎06-22-2023

Hi, Everyone.

Currently I try to implement spark structured streaming with Pyspark. And I would like to merge multiple rows in single row with array and sink to downstream message queue for another service to use. Related example can follow as:

* Before

| col1 |

| {"a": 1, "b": 2} |

| {"a": 2, "b": 3} |

* After

| col1 |

| [{"a": 1, "b": 2}, {"a": 2, "b": 3}] |

After I survey, can call `collect_list()` to process it. But this function will collect data to driver, so it have some risk of resulting driver node OOM. Especially, I also observe out spark structured streaming application in Databricks job metrics. Indeed have driver memory usage keep increasing and occurs OOM errors.

Based on this scenario, could we have a better solution to solve this and avoid driver node OOM at the same time? If you have any ideas, please share it. I will be appreciate it.

Anonymous · ‎06-23-2023

Hi @Mars Su

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.

MarsSu · ‎06-23-2023

Dear @Vidula Khanna ,

Thanks for your help. Hope we have a solution to solve it, thanks.

917074 · ‎01-19-2024

Is there any solution to this, @MarsSu were you able to solve this, kindly shed some light on this if you resolve this.

Databricks Community

How to implement merge multiple rows in single row with array and do not result in OOM?

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon