cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Applyinpandas executed twice

fh
New Contributor

Hi,

I have a dataframe containing records (sales) over time for +- 1000 different items, so based on these records each item has its own timeseries. The goal is to make predictions for each of these items. Since the behaviour of these items is very different, we opted for a separate model for each of these items. So there are 1000 models being trained and stored into mlflow experiments.

The training of these models happens in parallel using a groupby item on the dataframe and using applyinpandas to distribute them to the different workers. The problem is that when I run this on all items the applyinpandas function is executed twice. So each model is trained twice (I can see this in the experiments, there are two runs added for each item), resulting in a much longer runtime. When I run it on a single item it only creates one run.

Some information about the cluster:
DBR: 15.2 ML
Worker: Standard_DS5_v2, 56GB memory, 16 cores, min 2 max 4 workers
Driver: Standard_DS4_v2, 28GB memory, 8 cores

Some information about the data:
1000 items with +- 500 records per item
+- 15 numeric features
using memory_profiling I can see that around 600 MiB is passed for each item to the worker.

How can I avoid this double execution in applyinpandas, or is this expected behaviour?

Thanks in advance.

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @fh, ensure there are no redundant calls in your code and use logging within the function to track its execution. Check your cluster configuration and data partitioning to avoid misconfigurations and uneven distribution, respectively. Verify that your Databricks Runtime (DBR 15.2 ML) isn't causing the problem, and experiment with smaller data batches. 

I hope this helps! Let me know if you have any other questions or need further assistance.

KumaranT
New Contributor III

Hi @fh ,

To avoid this double execution, you can try using the concurrent.futures module in Python to parallelize the training of your models. This module provides a high-level interface for asynchronously executing callables.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group