Databricks Community

fh · ‎08-06-2024

Hi,

I have a dataframe containing records (sales) over time for +- 1000 different items, so based on these records each item has its own timeseries. The goal is to make predictions for each of these items. Since the behaviour of these items is very different, we opted for a separate model for each of these items. So there are 1000 models being trained and stored into mlflow experiments.

The training of these models happens in parallel using a groupby item on the dataframe and using applyinpandas to distribute them to the different workers. The problem is that when I run this on all items the applyinpandas function is executed twice. So each model is trained twice (I can see this in the experiments, there are two runs added for each item), resulting in a much longer runtime. When I run it on a single item it only creates one run.

Some information about the cluster:
DBR: 15.2 ML
Worker: Standard_DS5_v2, 56GB memory, 16 cores, min 2 max 4 workers
Driver: Standard_DS4_v2, 28GB memory, 8 cores

Some information about the data:
1000 items with +- 500 records per item
+- 15 numeric features
using memory_profiling I can see that around 600 MiB is passed for each item to the worker.

How can I avoid this double execution in applyinpandas, or is this expected behaviour?

Thanks in advance.

Retired_mod · ‎08-08-2024

Hi @fh, ensure there are no redundant calls in your code and use logging within the function to track its execution. Check your cluster configuration and data partitioning to avoid misconfigurations and uneven distribution, respectively. Verify that your Databricks Runtime (DBR 15.2 ML) isn't causing the problem, and experiment with smaller data batches.

I hope this helps! Let me know if you have any other questions or need further assistance.

KumaranT · ‎08-13-2024

Hi @fh ,

To avoid this double execution, you can try using the concurrent.futures module in Python to parallelize the training of your models. This module provides a high-level interface for asynchronously executing callables.

Databricks Community

Applyinpandas executed twice

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!