Re: How large should a dataset be so that it’s wor...

sean_owen · ‎06-17-2021

There's not one answer, but generally, you certainly need Spark when you can't fit the data on one machine in memory, as is usually required for non-distributed implementations. A terabyte or more is hard to put into memory, sometimes less.

But more generally when you want the workload to complete faster by running on multiple machines. You may be able to process 100GB in 10 hours, but, maybe you'd prefer to throw 100 machines at it instead and finish in 6 minutes for about the same cost. That's where Spark comes in.