Scaling issue for inference with a spark.mllib mod...

admo · ‎03-17-2022

Hello,

I'm writing this because I have tried a lot of different directions to get a simple model inference working with no success.

Here is the outline of the job

# 1 - Load the base data (~1 billion lines of ~6 columns)
interaction = build_initial_df()
 
# 2 - Preprocessing of some of the data (~ 200 to 2000 columns) 
feature_df = add_metadata(interaction, feature_transformer_model)
 
# 3 - cross feature building and logisitic regression inference
model = PipelineModel.load(model_save_path)
prediction_per_user = model.transform(feature_df)
final_df = prediction_per_user.select(...)
 
# 4 - write the final result
write_df_to_snowflake(final_df)

This work reasonnably well when the size of the input data is around 100 smaller.

But fails on the full scale.

The size of the cluster is reasonable : up to 40 r5d.2xlarge giving a bit more than 2Tb in RAM

Execution problem :

Very intense ressource usage and a lot of disk spill

Question :

My understanding ia model inference is is similar to a map operation thus very fractionable and can work in parallel given a reasonable amount of compute.

How could I make it work given my resource budget ? What am I missing ?

Already tried :

1) I have already tried using a MLflow model udf but it doesn't work on list feature outputed by a previous feature pipeline model

2) I disabled some of the optimization of spark because it would run a stage that would fill a single executor disk or time out

3) Forcing some repartition of the dataframe to process the dataset in smaller chunks (it may be improved)

Thanks in advance

Scaling issue for inference with a spark.mllib model