Re: Optimizing .collect() Usage in Spark

cgrant · ‎02-27-2025

Data from collect() will automatically be garbage collected after it is out of scope. However, it's not recommended for larger data. Instead, you can

Write to a staging table in Postgres via Spark's JDBC connector, then issue a command via JDBC that performs the delete between the staging table and your target. On the Spark side, this operation is distributed among the worker nodes with much less memory usage on the driver.
If you must use the Spark driver to perform this, try using toLocalIterator() instead of collect(), which avoids memory problems by collecting the data in pieces