Databricks

HamzaJosh · ‎10-27-2021

I have a function making api calls. I want to run this function in parallel so I can use the workers in databricks clusters to run it in parallel. I have tried

with ThreadPoolExecutor() as executor:

results = executor.map(getspeeddata, alist)

to run my function but this does not make use of the workers and runs everything on the driver. How do I make my function run in parallel?

Kaniz · ‎10-27-2021

Hi @ HamzaJosh! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Hubert-Dudek · ‎10-28-2021

Hi please make UDF (user defined function) and than tun it directly from dataframe.

I have dataframe with params url and udf load responses to new column as Sturctured Object and than is flatted.

-werners- · ‎10-28-2021

you want to make sure the Spark framework is used, and not just plain python/scala.

So a UDF is the way to go.

HamzaJosh · ‎10-28-2021

Thanks Hubert and werners for responding.

Please give me some urls which shows how I can create UDF's. Do i still need to use threadpool? How do I make it run in parallel after using a UDF?

I am newbie and need more than just create a UDF. Please help

jose_gonzalez · ‎10-29-2021

Hi @Hamza Josh ,

Here are some links that might be able to help you to undertand better how to create an run UDFs

UDFs in python here
Pandas UDFs here
More docs here

HamzaJosh · ‎11-01-2021

You guys are not getting the point, I am making API calls in a function and want to store the results in a dataframe. I want multiple processes to run this task in parallel.

How do I create a UDF and use it in a dataframe when the task is calling an API repeatedly and storing the JSON payload in BLOB storage? The examples you gave me are for making calculations etc. Please advise ASAP.

-werners- · ‎11-02-2021

I think we do get the point. But the thing is:

if you want to distribute the work to the workers, you have to use the spark framework.

So a UDF is the way to go (as UDF's are part of Spark).

Plain python code will only execute on the driver.

Also, Spark is lazy evaluated, meaning data is only queried/written when you apply an action.

That is pretty important.

So in the end you will have to create a UDF.

https://github.com/jamesshocking/Spark-REST-API-UDF-Scala is an example in Scala, but the same principles apply to pyspark.

Databricks

I want to use databricks workers to run a function in parallel on the worker nodes

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI