cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How large should a dataset be so that itโ€™s worth using Spark?

Anonymous
Not applicable
 
3 REPLIES 3

Ryan_Chynoweth
Honored Contributor III

As a general best practice Spark is useful when it becomes difficult to process data on a single machine. For example, Python users love using pandas but when DataFrames start to approach the 1-10 million row mark processing on a single machine becomes difficult.

A great aspect about Spark on Databricks is that you can only use the compute that you need. So if you are working with a smaller dataset that is too big for a single machine you can spin up a cluster with 1-2 workers.

sean_owen
Honored Contributor II
Honored Contributor II

There's not one answer, but generally, you certainly need Spark when you can't fit the data on one machine in memory, as is usually required for non-distributed implementations. A terabyte or more is hard to put into memory, sometimes less.

But more generally when you want the workload to complete faster by running on multiple machines. You may be able to process 100GB in 10 hours, but, maybe you'd prefer to throw 100 machines at it instead and finish in 6 minutes for about the same cost. That's where Spark comes in.

User16857281974
Contributor

@Ryan Chynowethโ€‹ and @Sean Owenโ€‹  are both right, but I have a different perspective on this.

Quick side note: you can also configure your cluster to execute with only a driver, and thus reducing the cost to the cheapest single VM available. In the cluster settings, set the Cluster Mode to Single Node.

To your specific question, I would assert that it's rather subjective (as others are stating). But Databricks Academy regularly uses Single Node machines and small datasets for demonstration and education purposes. Clearly, our use case is rather specific.

Personally, I started using Databricks (not specifically Spark) 3-4 years ago when I was working at a small telephone company. The datasets were laughably small, but the approachability of Databricks made it a no-brainer to processes our jobs with Spark.

Even more personally, for the same reason that Databricks is so approachable, I use it regularly to analyze my spending (downloading transactions from my bank), analyzing and processing my emails (trying to figure out who spams me the most and what filters could I write to declutter my inbox)

At the end of the day, VM prices are so cheap, with the ability to run in a single node, and given that Databricks is so approachable, I would assert there may be no minimum.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.