How to implement merge multiple rows in single row with array and do not result in OOM?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-22-2023 06:46 PM
Hi, Everyone.
Currently I try to implement spark structured streaming with Pyspark. And I would like to merge multiple rows in single row with array and sink to downstream message queue for another service to use. Related example can follow as:
* Before
| col1 |
| {"a": 1, "b": 2} |
| {"a": 2, "b": 3} |
* After
| col1 |
| [{"a": 1, "b": 2}, {"a": 2, "b": 3}] |
After I survey, can call `collect_list()` to process it. But this function will collect data to driver, so it have some risk of resulting driver node OOM. Especially, I also observe out spark structured streaming application in Databricks job metrics. Indeed have driver memory usage keep increasing and occurs OOM errors.
Based on this scenario, could we have a better solution to solve this and avoid driver node OOM at the same time? If you have any ideas, please share it. I will be appreciate it.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2023 12:16 AM
Hi @Mars Su
Great to meet you, and thanks for your question!
Let's see if your peers in the community have an answer to your question. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2023 06:06 PM
Dear @Vidula Khanna ,
Thanks for your help. Hope we have a solution to solve it, thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-19-2024 12:05 PM
Is there any solution to this, @MarsSu were you able to solve this, kindly shed some light on this if you resolve this.

