cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Choosing the optimal cluster size/specs.

sage5616
Valued Contributor

Hello everyone,

I am trying to determine the appropriate cluster specifications/sizing for my workload:

Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This task runs every 5 mins and needs to complete within a minute.

The size of the batch of input parquet files ranges from a 100 KB to 100 MB per run.

It is important that the cluster supports creating and querying persistent views. I do not know yet how many processes will query the views, but estimating about 1 - 10 concurrent queries with simple select statements filtering data.

Big thank you 🙂

I have already researched Databricks manuals and guides and looking for an opinion/recommendation from the community.

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

View solution in original post

3 REPLIES 3

Debayan
Esteemed Contributor III
Esteemed Contributor III

Thanks for your question!

Please check https://docs.databricks.com/clusters/cluster-config-best-practices.html and let me know if it helps?

sage5616
Valued Contributor

Thanks? But please pay attention to what I wrote in my question above: "I have already researched Databricks manuals and guides and looking for an opinion/recommendation from the community."

Simply pointing to a manual is not helpful. @Debayan Mukherjee​ 

Anonymous
Not applicable

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.