cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Choosing the optimal cluster size/specs.

sage5616
Valued Contributor

Hello everyone,

I am trying to determine the appropriate cluster specifications/sizing for my workload:

Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This task runs every 5 mins and needs to complete within a minute.

The size of the batch of input parquet files ranges from a 100 KB to 100 MB per run.

It is important that the cluster supports creating and querying persistent views. I do not know yet how many processes will query the views, but estimating about 1 - 10 concurrent queries with simple select statements filtering data.

Big thank you 🙂

I have already researched Databricks manuals and guides and looking for an opinion/recommendation from the community.

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

View solution in original post

3 REPLIES 3

Debayan
Databricks Employee
Databricks Employee

Thanks for your question!

Please check https://docs.databricks.com/clusters/cluster-config-best-practices.html and let me know if it helps?

sage5616
Valued Contributor

Thanks? But please pay attention to what I wrote in my question above: "I have already researched Databricks manuals and guides and looking for an opinion/recommendation from the community."

Simply pointing to a manual is not helpful. @Debayan Mukherjee​ 

Anonymous
Not applicable

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group