Hello everyone,
I am trying to determine the appropriate cluster specifications/sizing for my workload:
Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This task runs every 5 mins and needs to complete within a minute.
The size of the batch of input parquet files ranges from a 100 KB to 100 MB per run.
It is important that the cluster supports creating and querying persistent views. I do not know yet how many processes will query the views, but estimating about 1 - 10 concurrent queries with simple select statements filtering data.
Big thank you 🙂
I have already researched Databricks manuals and guides and looking for an opinion/recommendation from the community.