Re: How large should a dataset be so that it’s wor...

Ryan_Chynoweth · ‎06-11-2021

As a general best practice Spark is useful when it becomes difficult to process data on a single machine. For example, Python users love using pandas but when DataFrames start to approach the 1-10 million row mark processing on a single machine becomes difficult.

A great aspect about Spark on Databricks is that you can only use the compute that you need. So if you are working with a smaller dataset that is too big for a single machine you can spin up a cluster with 1-2 workers.