โ10-27-2021 03:27 PM
I have a function making api calls. I want to run this function in parallel so I can use the workers in databricks clusters to run it in parallel. I have tried
with ThreadPoolExecutor() as executor:
results = executor.map(getspeeddata, alist)
to run my function but this does not make use of the workers and runs everything on the driver. How do I make my function run in parallel?
โ10-27-2021 11:37 PM
Hi @ HamzaJosh! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.
โ10-28-2021 01:52 AM
Hi please make UDF (user defined function) and than tun it directly from dataframe.
I have dataframe with params url and udf load responses to new column as Sturctured Object and than is flatted.
โ10-28-2021 01:59 AM
you want to make sure the Spark framework is used, and not just plain python/scala.
So a UDF is the way to go.
โ10-28-2021 06:44 AM
Thanks Hubert and werners for responding.
Please give me some urls which shows how I can create UDF's. Do i still need to use threadpool? How do I make it run in parallel after using a UDF?
I am newbie and need more than just create a UDF. Please help
โ10-29-2021 05:12 PM
โ11-01-2021 06:49 AM
You guys are not getting the point, I am making API calls in a function and want to store the results in a dataframe. I want multiple processes to run this task in parallel.
How do I create a UDF and use it in a dataframe when the task is calling an API repeatedly and storing the JSON payload in BLOB storage? The examples you gave me are for making calculations etc. Please advise ASAP.
โ11-02-2021 01:33 AM
I think we do get the point. But the thing is:
if you want to distribute the work to the workers, you have to use the spark framework.
So a UDF is the way to go (as UDF's are part of Spark).
Plain python code will only execute on the driver.
Also, Spark is lazy evaluated, meaning data is only queried/written when you apply an action.
That is pretty important.
So in the end you will have to create a UDF.
https://github.com/jamesshocking/Spark-REST-API-UDF-Scala is an example in Scala, but the same principles apply to pyspark.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group