Re: Scaling issue for inference with a spark.mllib...

Hubert-Dudek · ‎03-17-2022

It is hard to analyze without Spark UI and more detailed information, but anyway few tips:

look for data skews some partitions can be very big some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions())
increase shuffle size spark.sql.shuffle.partitions default is 200 try bigger, I would go to 1000 at least even
increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integarte datadog with cluster)
make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors

My blog: https://databrickster.medium.com/