cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How large should a dataset be so that it’s worth using Spark?

Anonymous
Not applicable
 
3 REPLIES 3

Ryan_Chynoweth
Honored Contributor III

As a general best practice Spark is useful when it becomes difficult to process data on a single machine. For example, Python users love using pandas but when DataFrames start to approach the 1-10 million row mark processing on a single machine becomes difficult.

A great aspect about Spark on Databricks is that you can only use the compute that you need. So if you are working with a smaller dataset that is too big for a single machine you can spin up a cluster with 1-2 workers.

sean_owen
Honored Contributor II
Honored Contributor II

There's not one answer, but generally, you certainly need Spark when you can't fit the data on one machine in memory, as is usually required for non-distributed implementations. A terabyte or more is hard to put into memory, sometimes less.

But more generally when you want the workload to complete faster by running on multiple machines. You may be able to process 100GB in 10 hours, but, maybe you'd prefer to throw 100 machines at it instead and finish in 6 minutes for about the same cost. That's where Spark comes in.

User16857281974
Contributor

@Ryan Chynoweth​ and @Sean Owen​  are both right, but I have a different perspective on this.

Quick side note: you can also configure your cluster to execute with only a driver, and thus reducing the cost to the cheapest single VM available. In the cluster settings, set the Cluster Mode to Single Node.

To your specific question, I would assert that it's rather subjective (as others are stating). But Databricks Academy regularly uses Single Node machines and small datasets for demonstration and education purposes. Clearly, our use case is rather specific.

Personally, I started using Databricks (not specifically Spark) 3-4 years ago when I was working at a small telephone company. The datasets were laughably small, but the approachability of Databricks made it a no-brainer to processes our jobs with Spark.

Even more personally, for the same reason that Databricks is so approachable, I use it regularly to analyze my spending (downloading transactions from my bank), analyzing and processing my emails (trying to figure out who spams me the most and what filters could I write to declutter my inbox)

At the end of the day, VM prices are so cheap, with the ability to run in a single node, and given that Databricks is so approachable, I would assert there may be no minimum.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!