cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

I want to use databricks workers to run a function in parallel on the worker nodes

HamzaJosh
New Contributor II

I have a function making api calls. I want to run this function in parallel so I can use the workers in databricks clusters to run it in parallel. I have tried

with ThreadPoolExecutor() as executor:

 results = executor.map(getspeeddata, alist)

to run my function but this does not make use of the workers and runs everything on the driver. How do I make my function run in parallel?

6 REPLIES 6

Hubert-Dudek
Esteemed Contributor III

Hi please make UDF (user defined function) and than tun it directly from dataframe.

I have dataframe with params url and udf load responses to new column as Sturctured Object and than is flatted.

-werners-
Esteemed Contributor III

you want to make sure the Spark framework is used, and not just plain python/scala.

So a UDF is the way to go.

HamzaJosh
New Contributor II

Thanks Hubert and werners for responding.

Please give me some urls which shows how I can create UDF's. Do i still need to use threadpool? How do I make it run in parallel after using a UDF?

I am newbie and need more than just create a UDF. Please help

Hi @Hamza Josh​ ,

Here are some links that might be able to help you to undertand better how to create an run UDFs

HamzaJosh
New Contributor II

You guys are not getting the point, I am making API calls in a function and want to store the results in a dataframe. I want multiple processes to run this task in parallel.

How do I create a UDF and use it in a dataframe when the task is calling an API repeatedly and storing the JSON payload in BLOB storage? The examples you gave me are for making calculations etc. Please advise ASAP.

-werners-
Esteemed Contributor III

I think we do get the point. But the thing is:

if you want to distribute the work to the workers, you have to use the spark framework.

So a UDF is the way to go (as UDF's are part of Spark).

Plain python code will only execute on the driver.

Also, Spark is lazy evaluated, meaning data is only queried/written when you apply an action.

That is pretty important.

So in the end you will have to create a UDF.

https://github.com/jamesshocking/Spark-REST-API-UDF-Scala is an example in Scala, but the same principles apply to pyspark.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group