cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to implement merge multiple rows in single row with array and do not result in OOM?

MarsSu
New Contributor II

Hi, Everyone.

Currently I try to implement spark structured streaming with Pyspark. And I would like to merge multiple rows in single row with array and sink to downstream message queue for another service to use. Related example can follow as:

* Before

| col1 |

| {"a": 1, "b": 2} |

| {"a": 2, "b": 3} |

* After

| col1 |

| [{"a": 1, "b": 2}, {"a": 2, "b": 3}] |

After I survey, can call `collect_list()` to process it. But this function will collect data to driver, so it have some risk of resulting driver node OOM. Especially, I also observe out spark structured streaming application in Databricks job metrics. Indeed have driver memory usage keep increasing and occurs OOM errors.

Based on this scenario, could we have a better solution to solve this and avoid driver node OOM at the same time? If you have any ideas, please share it. I will be appreciate it.

3 REPLIES 3

Anonymous
Not applicable

Hi @Mars Su​ 

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.

MarsSu
New Contributor II

Dear @Vidula Khanna​ ,

Thanks for your help. Hope we have a solution to solve it, thanks.

917074
New Contributor II

Is there any solution to this, @MarsSu  were you able to solve this, kindly shed some light on this if you resolve this.