Hello,
I'm writing this because I have tried a lot of different directions to get a simple model inference working with no success.
Here is the outline of the job
# 1 - Load the base data (~1 billion lines of ~6 columns)
interaction = build_initial_df()
# 2 - Preprocessing of some of the data (~ 200 to 2000 columns)
feature_df = add_metadata(interaction, feature_transformer_model)
# 3 - cross feature building and logisitic regression inference
model = PipelineModel.load(model_save_path)
prediction_per_user = model.transform(feature_df)
final_df = prediction_per_user.select(...)
# 4 - write the final result
write_df_to_snowflake(final_df)
This work reasonnably well when the size of the input data is around 100 smaller.
But fails on the full scale.
The size of the cluster is reasonable : up to 40 r5d.2xlarge giving a bit more than 2Tb in RAM
Execution problem :
Very intense ressource usage and a lot of disk spill
Question :
My understanding ia model inference is is similar to a map operation thus very fractionable and can work in parallel given a reasonable amount of compute.
How could I make it work given my resource budget ? What am I missing ?
Already tried :
1) I have already tried using a MLflow model udf but it doesn't work on list feature outputed by a previous feature pipeline model
2) I disabled some of the optimization of spark because it would run a stage that would fill a single executor disk or time out
3) Forcing some repartition of the dataframe to process the dataset in smaller chunks (it may be improved)
Thanks in advance