I want to use databricks workers to run a function in parallel on the worker nodes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-27-2021 03:27 PM
I have a function making api calls. I want to run this function in parallel so I can use the workers in databricks clusters to run it in parallel. I have tried
with ThreadPoolExecutor() as executor:
results = executor.map(getspeeddata, alist)
to run my function but this does not make use of the workers and runs everything on the driver. How do I make my function run in parallel?
- Labels:
-
Api Calls
-
Worker Nodes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2021 01:52 AM
Hi please make UDF (user defined function) and than tun it directly from dataframe.
I have dataframe with params url and udf load responses to new column as Sturctured Object and than is flatted.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2021 01:59 AM
you want to make sure the Spark framework is used, and not just plain python/scala.
So a UDF is the way to go.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2021 06:44 AM
Thanks Hubert and werners for responding.
Please give me some urls which shows how I can create UDF's. Do i still need to use threadpool? How do I make it run in parallel after using a UDF?
I am newbie and need more than just create a UDF. Please help
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-29-2021 05:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-01-2021 06:49 AM
You guys are not getting the point, I am making API calls in a function and want to store the results in a dataframe. I want multiple processes to run this task in parallel.
How do I create a UDF and use it in a dataframe when the task is calling an API repeatedly and storing the JSON payload in BLOB storage? The examples you gave me are for making calculations etc. Please advise ASAP.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2021 01:33 AM
I think we do get the point. But the thing is:
if you want to distribute the work to the workers, you have to use the spark framework.
So a UDF is the way to go (as UDF's are part of Spark).
Plain python code will only execute on the driver.
Also, Spark is lazy evaluated, meaning data is only queried/written when you apply an action.
That is pretty important.
So in the end you will have to create a UDF.
https://github.com/jamesshocking/Spark-REST-API-UDF-Scala is an example in Scala, but the same principles apply to pyspark.

