- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-03-2022 03:06 PM
Hello everyone,
I am trying to determine the appropriate cluster specifications/sizing for my workload:
Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This task runs every 5 mins and needs to complete within a minute.
The size of the batch of input parquet files ranges from a 100 KB to 100 MB per run.
It is important that the cluster supports creating and querying persistent views. I do not know yet how many processes will query the views, but estimating about 1 - 10 concurrent queries with simple select statements filtering data.
Big thank you 🙂
I have already researched Databricks manuals and guides and looking for an opinion/recommendation from the community.
- Labels:
-
Avro
-
Cluster
-
ETL
-
KB
-
Parquet
-
Persistent View
-
SQL
-
Transformation
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-07-2022 01:25 PM
If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-03-2022 11:34 PM
Thanks for your question!
Please check https://docs.databricks.com/clusters/cluster-config-best-practices.html and let me know if it helps?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-04-2022 06:40 AM
Thanks? But please pay attention to what I wrote in my question above: "I have already researched Databricks manuals and guides and looking for an opinion/recommendation from the community."
Simply pointing to a manual is not helpful. @Debayan Mukherjee
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-07-2022 01:25 PM
If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

