Hey guys, I've been going through a performance problem in my current Workflow.
Here's my use case:
- We have several Notebooks, each one is responsible for calculating a specific metric (just like AOV, GMV, etc)
- I made a pipeline that creates a dataframe with all the information that I need, and at the end of the job, I create a Task that consists of a JSON transformation of that specific DF.
"{"experimentKey": "ExperimentName\", "startDate": "2024-10-22", "endDate": null, "status": "IN PROGRESS", "country": "BR", "variations": ["control_variation", "variant_a", "variant_b", "variant_c"], "lastUpdate": "2024-10-18", "Metrics": "ctr_partner", "isPrimary": true, "isGuardrail": false}",
This repeats for each one of the Metrics, only changing the "Metrics" field.
- So, by using the "For each" on the Workflows, it opens a notebook that has this:
dbutils.notebook.run(
f"/Workspace/Users/Platform/metrics_multiple_t_test/{api['Metrics']}",
0,
{
"experiment_id": api["experimentKey"],
"experiment_start": str(api["startDate"]),
"isPrimary": api["isPrimary"],
"isGuardrail": api["isGuardrail"],
"metric": api["Metrics"],
"environment": environment,
},
- It calls the specific metric notebook that I need, passing the necessary information as parameters. The process used to work fine using a Serverless cluster, but now it takes an eternity since I'm using a dedicated cluster. These are the specifications of the cluster:
1-5 Workers
64-320 GB Memory
8-40 Cores
1 Driver
64 GB Memory, 8 Cores
Runtime14.3.x-scala2.12
11–33 DBU/h
How can I improve this process?