With Spark, there are a few ways you can scale your model:
- Training
- Hyperparameter tuning
- Inference
If you're looking to train one model across multiple workers, you can leverage Horovod. It's an open source project designed to simplify distributed neural network training, and supports Keras/TF/PyTorch/MXNet. See the docs for HorovodRunner.
If you're looking to train many candidate models in parallel, you can use HyperOpt with SparkTrials. Check out this fantastic blog on best practices on best practices and tips on setting parallelism for SparkTrials.
You can always create a Spark UDF (super easy if you MLflow, e.g. mlflow.pyfunc.spark_udf) to trivially do inference in parallel for batch/streaming use cases.