Databricks

Anonymous · ‎06-10-2021

Ryan_Chynoweth · ‎06-11-2021

As a general best practice Spark is useful when it becomes difficult to process data on a single machine. For example, Python users love using pandas but when DataFrames start to approach the 1-10 million row mark processing on a single machine becomes difficult.

A great aspect about Spark on Databricks is that you can only use the compute that you need. So if you are working with a smaller dataset that is too big for a single machine you can spin up a cluster with 1-2 workers.

sean_owen · ‎06-17-2021

There's not one answer, but generally, you certainly need Spark when you can't fit the data on one machine in memory, as is usually required for non-distributed implementations. A terabyte or more is hard to put into memory, sometimes less.

But more generally when you want the workload to complete faster by running on multiple machines. You may be able to process 100GB in 10 hours, but, maybe you'd prefer to throw 100 machines at it instead and finish in 6 minutes for about the same cost. That's where Spark comes in.

User16857281974 · ‎07-30-2021

@Ryan Chynoweth and @Sean Owen are both right, but I have a different perspective on this.

Quick side note: you can also configure your cluster to execute with only a driver, and thus reducing the cost to the cheapest single VM available. In the cluster settings, set the Cluster Mode to Single Node.

To your specific question, I would assert that it's rather subjective (as others are stating). But Databricks Academy regularly uses Single Node machines and small datasets for demonstration and education purposes. Clearly, our use case is rather specific.

Personally, I started using Databricks (not specifically Spark) 3-4 years ago when I was working at a small telephone company. The datasets were laughably small, but the approachability of Databricks made it a no-brainer to processes our jobs with Spark.

Even more personally, for the same reason that Databricks is so approachable, I use it regularly to analyze my spending (downloading transactions from my bank), analyzing and processing my emails (trying to figure out who spams me the most and what filters could I write to declutter my inbox)

At the end of the day, VM prices are so cheap, with the ability to run in a single node, and given that Databricks is so approachable, I would assert there may be no minimum.

Databricks

How large should a dataset be so that it’s worth using Spark?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI