-werners-
Esteemed Contributor III

1 million records is really nothing. Spark can handle that without any problem, even on a single worker cluster.

The important thing is to NOT use a loop to iterate over every record (which I suspect you do) When you create a dataframe from your data, you can consider a column as a vector or a set, which can be manipulated in one single go.

f.e.

dataframe = dataframe.withcolumn("newcol", concat("col1", "col2"))

this will add a column newcol without having to worry about looping and state etc.