Scaling issue for inference with a spark.mllib model

admo
New Contributor III

Hello,

I'm writing this because I have tried a lot of different directions to get a simple model inference working with no success.

Here is the outline of the job

# 1 - Load the base data (~1 billion lines of ~6 columns)
interaction = build_initial_df()
 
# 2 - Preprocessing of some of the data (~ 200 to 2000 columns) 
feature_df = add_metadata(interaction, feature_transformer_model)
 
# 3 - cross feature building and logisitic regression inference
model = PipelineModel.load(model_save_path)
prediction_per_user = model.transform(feature_df)
final_df = prediction_per_user.select(...)
 
# 4 - write the final result
write_df_to_snowflake(final_df)

This work reasonnably well when the size of the input data is around 100 smaller.

But fails on the full scale.

The size of the cluster is reasonable : up to 40 r5d.2xlarge giving a bit more than 2Tb in RAM

Execution problem :

Very intense ressource usage and a lot of disk spill

Question :

My understanding ia model inference is is similar to a map operation thus very fractionable and can work in parallel given a reasonable amount of compute.

How could I make it work given my resource budget ? What am I missing ?

Already tried :

1) I have already tried using a MLflow model udf but it doesn't work on list feature outputed by a previous feature pipeline model

2) I disabled some of the optimization of spark because it would run a stage that would fill a single executor disk or time out

3) Forcing some repartition of the dataframe to process the dataset in smaller chunks (it may be improved)

Thanks in advance