Hi,
I have a dataframe containing records (sales) over time for +- 1000 different items, so based on these records each item has its own timeseries. The goal is to make predictions for each of these items. Since the behaviour of these items is very different, we opted for a separate model for each of these items. So there are 1000 models being trained and stored into mlflow experiments.
The training of these models happens in parallel using a groupby item on the dataframe and using applyinpandas to distribute them to the different workers. The problem is that when I run this on all items the applyinpandas function is executed twice. So each model is trained twice (I can see this in the experiments, there are two runs added for each item), resulting in a much longer runtime. When I run it on a single item it only creates one run.
Some information about the cluster:
DBR: 15.2 ML
Worker: Standard_DS5_v2, 56GB memory, 16 cores, min 2 max 4 workers
Driver: Standard_DS4_v2, 28GB memory, 8 cores
Some information about the data:
1000 items with +- 500 records per item
+- 15 numeric features
using memory_profiling I can see that around 600 MiB is passed for each item to the worker.
How can I avoid this double execution in applyinpandas, or is this expected behaviour?
Thanks in advance.