cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How large should a dataset be so that it’s worth using Spark?

Anonymous
Not applicable
 
3 REPLIES 3

Ryan_Chynoweth
Esteemed Contributor

As a general best practice Spark is useful when it becomes difficult to process data on a single machine. For example, Python users love using pandas but when DataFrames start to approach the 1-10 million row mark processing on a single machine becomes difficult.

A great aspect about Spark on Databricks is that you can only use the compute that you need. So if you are working with a smaller dataset that is too big for a single machine you can spin up a cluster with 1-2 workers.

sean_owen
Databricks Employee
Databricks Employee

There's not one answer, but generally, you certainly need Spark when you can't fit the data on one machine in memory, as is usually required for non-distributed implementations. A terabyte or more is hard to put into memory, sometimes less.

But more generally when you want the workload to complete faster by running on multiple machines. You may be able to process 100GB in 10 hours, but, maybe you'd prefer to throw 100 machines at it instead and finish in 6 minutes for about the same cost. That's where Spark comes in.

User16857281974
Contributor

@Ryan Chynoweth​ and @Sean Owen​  are both right, but I have a different perspective on this.

Quick side note: you can also configure your cluster to execute with only a driver, and thus reducing the cost to the cheapest single VM available. In the cluster settings, set the Cluster Mode to Single Node.

To your specific question, I would assert that it's rather subjective (as others are stating). But Databricks Academy regularly uses Single Node machines and small datasets for demonstration and education purposes. Clearly, our use case is rather specific.

Personally, I started using Databricks (not specifically Spark) 3-4 years ago when I was working at a small telephone company. The datasets were laughably small, but the approachability of Databricks made it a no-brainer to processes our jobs with Spark.

Even more personally, for the same reason that Databricks is so approachable, I use it regularly to analyze my spending (downloading transactions from my bank), analyzing and processing my emails (trying to figure out who spams me the most and what filters could I write to declutter my inbox)

At the end of the day, VM prices are so cheap, with the ability to run in a single node, and given that Databricks is so approachable, I would assert there may be no minimum.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group