Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-08-2022 04:09 AM
1 million records is really nothing. Spark can handle that without any problem, even on a single worker cluster.
The important thing is to NOT use a loop to iterate over every record (which I suspect you do) When you create a dataframe from your data, you can consider a column as a vector or a set, which can be manipulated in one single go.
f.e.
dataframe = dataframe.withcolumn("newcol", concat("col1", "col2"))
this will add a column newcol without having to worry about looping and state etc.