Databricks Community

MarsSu · ‎06-22-2023

Hi, Everyone.

Currently I try to implement spark structured streaming with Pyspark. And I would like to merge multiple rows in single row with array and sink to downstream message queue for another service to use. Related example can follow as:

* Before

| col1 |

| {"a": 1, "b": 2} |

| {"a": 2, "b": 3} |

* After

| col1 |

| [{"a": 1, "b": 2}, {"a": 2, "b": 3}] |

After I survey, can call `collect_list()` to process it. But this function will collect data to driver, so it have some risk of resulting driver node OOM. Especially, I also observe out spark structured streaming application in Databricks job metrics. Indeed have driver memory usage keep increasing and occurs OOM errors.

Based on this scenario, could we have a better solution to solve this and avoid driver node OOM at the same time? If you have any ideas, please share it. I will be appreciate it.