topic Re: I want to use databricks workers to run a function in parallel on the worker nodes in Data Engineering

I want to use databricks workers to run a function in parallel on the worker nodes

HamzaJosh — Wed, 27 Oct 2021 22:27:38 GMT

I have a function making api calls. I want to run this function in parallel so I can use the workers in databricks clusters to run it in parallel. I have tried

with ThreadPoolExecutor() as executor:

results = executor.map(getspeeddata, alist)

to run my function but this does not make use of the workers and runs everything on the driver. How do I make my function run in parallel?

Re: I want to use databricks workers to run a function in parallel on the worker nodes

Hubert-Dudek — Thu, 28 Oct 2021 08:52:36 GMT

Hi please make UDF (user defined function) and than tun it directly from dataframe.

I have dataframe with params url and udf load responses to new column as Sturctured Object and than is flatted.

Re: I want to use databricks workers to run a function in parallel on the worker nodes

-werners- — Thu, 28 Oct 2021 08:59:43 GMT

you want to make sure the Spark framework is used, and not just plain python/scala.

So a UDF is the way to go.

Re: I want to use databricks workers to run a function in parallel on the worker nodes

HamzaJosh — Thu, 28 Oct 2021 13:44:51 GMT

Thanks Hubert and werners for responding.

Please give me some urls which shows how I can create UDF's. Do i still need to use threadpool? How do I make it run in parallel after using a UDF?

I am newbie and need more than just create a UDF. Please help

Re: I want to use databricks workers to run a function in parallel on the worker nodes

jose_gonzalez — Sat, 30 Oct 2021 00:12:20 GMT

Hi @Hamza Josh ,

Here are some links that might be able to help you to undertand better how to create an run UDFs

UDFs in python here
Pandas UDFs here
More docs here

Re: I want to use databricks workers to run a function in parallel on the worker nodes

HamzaJosh — Mon, 01 Nov 2021 13:49:53 GMT

You guys are not getting the point, I am making API calls in a function and want to store the results in a dataframe. I want multiple processes to run this task in parallel.

How do I create a UDF and use it in a dataframe when the task is calling an API repeatedly and storing the JSON payload in BLOB storage? The examples you gave me are for making calculations etc. Please advise ASAP.

Re: I want to use databricks workers to run a function in parallel on the worker nodes

-werners- — Tue, 02 Nov 2021 08:33:43 GMT

I think we do get the point. But the thing is:

if you want to distribute the work to the workers, you have to use the spark framework.

So a UDF is the way to go (as UDF's are part of Spark).

Plain python code will only execute on the driver.

Also, Spark is lazy evaluated, meaning data is only queried/written when you apply an action.

That is pretty important.

So in the end you will have to create a UDF.

https://github.com/jamesshocking/Spark-REST-API-UDF-Scala is an example in Scala, but the same principles apply to pyspark.

Re: I want to use databricks workers to run a function in parallel on the worker nodes

mordex — Sun, 05 Apr 2026 18:02:46 GMT

Hi Hubert,

I have the same problem. We are calling 40-50 different api's running sequentially. Now, after creating udf and dataframe with url and udf, how to pass credentials username and password.

Do we need to broadcast credentials so they are available at every worker?