I've been running some performance tests with Databricks, but I am struggling to make sense of the results.
def g(a, b):
N = 1000
t1 = time.time()
for _ in range(N):
a**5 + 2 * b
t2 = time.time()
return (t2 - t1) / N
a = np.random.rand(2**24)
b = np.random.rand(2**24)
Evaluating g(a, b) returns around 0.15 on a modest "Standard_D8ds_v5" cluster with 8 cores on the driver, while it return around 0.61 on a powerful "Standard_E32_v3" with 32 cores on the driver. In other words, the same calculation takes around 4 times longer on the powerful cluster.
Considering that Numpy uses BLAS, which is supposed to bypass Python's GIL lock, I am struggling to find an explanation for what I am seeing.
Can somebody makes sense of this and suggest a way to improve the performance on the powerful cluster?
PS I am aware that I could distribute the work in the loop of my function over multiple workers, using a tool like Ray Clusters, but that is not what I am after here.