- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-07-2025 11:24 PM
Hi,
Recently, I had some logic to collect the dataframe and process row by row. I am using 128GB driver node but it is taking significantly more time (like 2 hours for just 700 rows of data).
May I know which type of cluster should I use and the driver size?
- Labels:
-
Spark
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-08-2025 12:35 AM
Hi @Avinash_Narala , Good Day!
For right-sizing the cluster, the recommended approach is a hybrid approach for node provisioning in the cluster along with autoscaling. This involves defining the number of on-demand instances and spot instances for the cluster and enabling autoscaling between the minimum and the maximum number of instances. This allows the cluster to scale up and down depending on the load. Also, please refer to the below documents for more information.
Please let me know if this helps and leave a like if this information is useful, followups are appreciated.
Kudos
Ayushi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-08-2025 12:35 AM
Hi @Avinash_Narala , Good Day!
For right-sizing the cluster, the recommended approach is a hybrid approach for node provisioning in the cluster along with autoscaling. This involves defining the number of on-demand instances and spot instances for the cluster and enabling autoscaling between the minimum and the maximum number of instances. This allows the cluster to scale up and down depending on the load. Also, please refer to the below documents for more information.
Please let me know if this helps and leave a like if this information is useful, followups are appreciated.
Kudos
Ayushi

