cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Performance issue with pyspark udf function calling rest api

sensanjoy
Contributor

Hi All,

I am facing some performance issue with one of pyspark udf function that post data to REST API(uses cosmos db backend to store the data).

Please find the details below:

# The spark dataframe(df) contains near about 30-40k data.

# I am using python udf function to post it over rest api:

  Ex. final_df = df.withColumn('status', save_data('A', 'B', 'C'))

# udf function:

@udf(returnType = IntegerType())

def save_data(A, B, C):

   

post_data = list()

post_data.append({

'A': A,

'B': B,

'C': C,

})

  

retry_strategy = Retry(

total = 5,

status_forcelist = [400, 500, 502, 503, 504],

method_whitelist = ['POST'],

backoff_factor = 0.1

)

  

adapter = HTTPAdapter(max_retries = retry_strategy)

s = requests.Session()

s.mount('https://', adapter)

s.mount('http://', adapter)

s.keep_alive = False

try:

response = s.post(

 url = rest_api_url,

 headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},

 data = json.dumps(post_data)

)

return response.status_code

 

except:

response = requests.post(

 url = rest_api_url,

 headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},

 data = json.dumps(post_data)

)

return response.status_code

# Issue: Databricks jobs gets hanged for infinite time at rest api call(save_data()) and never succeeded.

# When checked from API end, its showing the service touching maximum resource utilization(100%).

To me it looks like the python udf is sending bulk data at a time which overwhelmed the api service at some point of time and it stopped responding.

What would be the best way we can overcome this problem?

Should we split the dataframe into multiple chunks and send it out one-by-one or

convert it to pandas df and then send out row-by-row (might be slow)

Kindly suggest.

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Sanjoy Sen​ :

It looks like the UDF function is making a synchronous HTTP request to the REST API for each row in the dataframe, which can cause performance issues when processing a large amount of data.

To improve the performance, you can consider the following approaches:

  1. Batch the data: Instead of sending each row to the REST API, you can batch the data and send multiple rows at once. This will reduce the number of HTTP requests and improve the overall performance. You can use the foreachPartition API of Spark to send the data in batches.
  2. Use an asynchronous HTTP client: You can use an asynchronous HTTP client like aiohttp or httpx in the UDF function to send the data asynchronously. This will allow you to send multiple HTTP requests in parallel and improve the overall performance.
  3. Use a distributed system: If the data is too large to be processed on a single machine, you can consider using a distributed system like Apache Kafka or Apache Spark Streaming to process the data in real-time and send it to the REST API.
  4. Use a dedicated data processing framework: You can use a dedicated data processing framework like Apache Beam or Apache Flink to process the data and send it to the REST API. These frameworks are designed for processing large amounts of data and can handle data ingestion, processing, and delivery.

In general, sending data row-by-row from a UDF can be inefficient and cause performance issues, especially when dealing with large datasets. It's better to use batch processing and asynchronous HTTP requests to improve the performance.

View solution in original post

6 REPLIES 6

Anonymous
Not applicable

@Sanjoy Sen​ :

It looks like the UDF function is making a synchronous HTTP request to the REST API for each row in the dataframe, which can cause performance issues when processing a large amount of data.

To improve the performance, you can consider the following approaches:

  1. Batch the data: Instead of sending each row to the REST API, you can batch the data and send multiple rows at once. This will reduce the number of HTTP requests and improve the overall performance. You can use the foreachPartition API of Spark to send the data in batches.
  2. Use an asynchronous HTTP client: You can use an asynchronous HTTP client like aiohttp or httpx in the UDF function to send the data asynchronously. This will allow you to send multiple HTTP requests in parallel and improve the overall performance.
  3. Use a distributed system: If the data is too large to be processed on a single machine, you can consider using a distributed system like Apache Kafka or Apache Spark Streaming to process the data in real-time and send it to the REST API.
  4. Use a dedicated data processing framework: You can use a dedicated data processing framework like Apache Beam or Apache Flink to process the data and send it to the REST API. These frameworks are designed for processing large amounts of data and can handle data ingestion, processing, and delivery.

In general, sending data row-by-row from a UDF can be inefficient and cause performance issues, especially when dealing with large datasets. It's better to use batch processing and asynchronous HTTP requests to improve the performance.

@Suteja Kanuri​  Thanks for your inputs and it really make sense as you mentioned that bulk data sending is better than row-by-row request provided that API support bulk insert. I will also look into the asynchronous http request.

Anonymous
Not applicable

@Sanjoy Sen​ : Thats wonderful! 🙂

JUMAN4422
New Contributor II

can we pass the /api/2.0/pipelines/{pipeline_id} url to start the dlt pipeline inside a asynchronous function,
this is for parallelly executing multiple tables.

Anonymous
Not applicable

Hi @Sanjoy Sen​ 

Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.

Please help us select the best solution by clicking on "Select As Best" if it does.

Your feedback will help us ensure that we are providing the best possible service to you. Thank you!

done.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group