Anonymous
Not applicable

Python code runs on the driver. Distributed/Spark code runs on the workers.

Here are some cluster tips:

If you're doing ML, then use an ML runtime.

If you're not doing distributed stuff, use a single node cluster.

Don't use autoscaling for ML.

For Deep Learning use GPUs

Try to size the cluster for the data size.

View solution in original post