Databricks Community

sensanjoy · ‎03-29-2023

Hi All,

I am facing some performance issue with one of pyspark udf function that post data to REST API(uses cosmos db backend to store the data).

Please find the details below:

# The spark dataframe(df) contains near about 30-40k data.

# I am using python udf function to post it over rest api:

Ex. final_df = df.withColumn('status', save_data('A', 'B', 'C'))

# udf function:

@udf(returnType = IntegerType())

def save_data(A, B, C):

post_data = list()

post_data.append({

'A': A,

'B': B,

'C': C,

})

retry_strategy = Retry(

total = 5,

status_forcelist = [400, 500, 502, 503, 504],

method_whitelist = ['POST'],

backoff_factor = 0.1

)

adapter = HTTPAdapter(max_retries = retry_strategy)

s = requests.Session()

s.mount('https://', adapter)

s.mount('http://', adapter)

s.keep_alive = False

try:

response = s.post(

url = rest_api_url,

headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},

data = json.dumps(post_data)

)

return response.status_code

except:

response = requests.post(

url = rest_api_url,

headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},

data = json.dumps(post_data)

)

return response.status_code

# Issue: Databricks jobs gets hanged for infinite time at rest api call(save_data()) and never succeeded.

# When checked from API end, its showing the service touching maximum resource utilization(100%).

To me it looks like the python udf is sending bulk data at a time which overwhelmed the api service at some point of time and it stopped responding.

What would be the best way we can overcome this problem?

Should we split the dataframe into multiple chunks and send it out one-by-one or

convert it to pandas df and then send out row-by-row (might be slow)

Kindly suggest.

Anonymous · ‎04-02-2023

@Sanjoy Sen :

It looks like the UDF function is making a synchronous HTTP request to the REST API for each row in the dataframe, which can cause performance issues when processing a large amount of data.

To improve the performance, you can consider the following approaches:

Batch the data: Instead of sending each row to the REST API, you can batch the data and send multiple rows at once. This will reduce the number of HTTP requests and improve the overall performance. You can use the foreachPartition API of Spark to send the data in batches.
Use an asynchronous HTTP client: You can use an asynchronous HTTP client like aiohttp or httpx in the UDF function to send the data asynchronously. This will allow you to send multiple HTTP requests in parallel and improve the overall performance.
Use a distributed system: If the data is too large to be processed on a single machine, you can consider using a distributed system like Apache Kafka or Apache Spark Streaming to process the data in real-time and send it to the REST API.
Use a dedicated data processing framework: You can use a dedicated data processing framework like Apache Beam or Apache Flink to process the data and send it to the REST API. These frameworks are designed for processing large amounts of data and can handle data ingestion, processing, and delivery.

In general, sending data row-by-row from a UDF can be inefficient and cause performance issues, especially when dealing with large datasets. It's better to use batch processing and asynchronous HTTP requests to improve the performance.

View solution in original post

Anonymous · ‎04-02-2023

@Sanjoy Sen :

It looks like the UDF function is making a synchronous HTTP request to the REST API for each row in the dataframe, which can cause performance issues when processing a large amount of data.

To improve the performance, you can consider the following approaches:

Batch the data: Instead of sending each row to the REST API, you can batch the data and send multiple rows at once. This will reduce the number of HTTP requests and improve the overall performance. You can use the foreachPartition API of Spark to send the data in batches.
Use an asynchronous HTTP client: You can use an asynchronous HTTP client like aiohttp or httpx in the UDF function to send the data asynchronously. This will allow you to send multiple HTTP requests in parallel and improve the overall performance.
Use a distributed system: If the data is too large to be processed on a single machine, you can consider using a distributed system like Apache Kafka or Apache Spark Streaming to process the data in real-time and send it to the REST API.
Use a dedicated data processing framework: You can use a dedicated data processing framework like Apache Beam or Apache Flink to process the data and send it to the REST API. These frameworks are designed for processing large amounts of data and can handle data ingestion, processing, and delivery.

In general, sending data row-by-row from a UDF can be inefficient and cause performance issues, especially when dealing with large datasets. It's better to use batch processing and asynchronous HTTP requests to improve the performance.

sensanjoy · ‎04-03-2023

@Suteja Kanuri Thanks for your inputs and it really make sense as you mentioned that bulk data sending is better than row-by-row request provided that API support bulk insert. I will also look into the asynchronous http request.

Anonymous · ‎04-05-2023

@Sanjoy Sen : Thats wonderful! 🙂

JUMAN4422 · ‎12-18-2024

can we pass the /api/2.0/pipelines/{pipeline_id} url to start the dlt pipeline inside a asynchronous function,
this is for parallelly executing multiple tables.

Anonymous · ‎04-03-2023

Hi @Sanjoy Sen

Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.

Please help us select the best solution by clicking on "Select As Best" if it does.

Your feedback will help us ensure that we are providing the best possible service to you. Thank you!

sensanjoy · ‎04-05-2023

done.

Databricks Community

Performance issue with pyspark udf function calling rest api

Connect with Databricks Users in Your Area

Virtual Learning Festival: 9 April - 30 April

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks DevConnect: Global Community Meetups for Data Engineers

Databricks Community Champion - February 2025 - Stefan Koch