Hi, Everyone.
Currently I try to implement spark structured streaming with Pyspark. And I would like to merge multiple rows in single row with array and sink to downstream message queue for another service to use. Related example can follow as:
* Before
| col1 |
| {"a": 1, "b": 2} |
| {"a": 2, "b": 3} |
* After
| col1 |
| [{"a": 1, "b": 2}, {"a": 2, "b": 3}] |
After I survey, can call `collect_list()` to process it. But this function will collect data to driver, so it have some risk of resulting driver node OOM. Especially, I also observe out spark structured streaming application in Databricks job metrics. Indeed have driver memory usage keep increasing and occurs OOM errors.
Based on this scenario, could we have a better solution to solve this and avoid driver node OOM at the same time? If you have any ideas, please share it. I will be appreciate it.