Re: Optimizing .collect() Usage in Spark

jeremy98 · ‎02-27-2025

Hi,

Thanks for the answer. For the moment I'm going to follow the second point using toLocalIterator(). I changed my lines of code with this call for example:

delete_data = [tuple(row) for row in records_to_delete_df.toLocalIterator()]

But, I was thinking, using this method spark collects the data in pieces right? So, this means that automatically doesn't collect the data at once but in pieces. If the cluster remains active, in the future the OOM could arise btw if we are using this type of call? The driver in pieces make free the memory everytime?