- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-29-2023 02:42 AM
Hi All,
I am facing some performance issue with one of pyspark udf function that post data to REST API(uses cosmos db backend to store the data).
Please find the details below:
# The spark dataframe(df) contains near about 30-40k data.
# I am using python udf function to post it over rest api:
Ex. final_df = df.withColumn('status', save_data('A', 'B', 'C'))
# udf function:
@udf(returnType = IntegerType())
def save_data(A, B, C):
post_data = list()
post_data.append({
'A': A,
'B': B,
'C': C,
})
retry_strategy = Retry(
total = 5,
status_forcelist = [400, 500, 502, 503, 504],
method_whitelist = ['POST'],
backoff_factor = 0.1
)
adapter = HTTPAdapter(max_retries = retry_strategy)
s = requests.Session()
s.mount('https://', adapter)
s.mount('http://', adapter)
s.keep_alive = False
try:
response = s.post(
url = rest_api_url,
headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},
data = json.dumps(post_data)
)
return response.status_code
except:
response = requests.post(
url = rest_api_url,
headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},
data = json.dumps(post_data)
)
return response.status_code
# Issue: Databricks jobs gets hanged for infinite time at rest api call(save_data()) and never succeeded.
# When checked from API end, its showing the service touching maximum resource utilization(100%).
To me it looks like the python udf is sending bulk data at a time which overwhelmed the api service at some point of time and it stopped responding.
What would be the best way we can overcome this problem?
Should we split the dataframe into multiple chunks and send it out one-by-one or
convert it to pandas df and then send out row-by-row (might be slow)
Kindly suggest.
- Labels:
-
Performance Issue
-
PySpark UDF
-
Rest API
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-02-2023 07:19 AM
@Sanjoy Sen :
It looks like the UDF function is making a synchronous HTTP request to the REST API for each row in the dataframe, which can cause performance issues when processing a large amount of data.
To improve the performance, you can consider the following approaches:
- Batch the data: Instead of sending each row to the REST API, you can batch the data and send multiple rows at once. This will reduce the number of HTTP requests and improve the overall performance. You can use the foreachPartition API of Spark to send the data in batches.
- Use an asynchronous HTTP client: You can use an asynchronous HTTP client like aiohttp or httpx in the UDF function to send the data asynchronously. This will allow you to send multiple HTTP requests in parallel and improve the overall performance.
- Use a distributed system: If the data is too large to be processed on a single machine, you can consider using a distributed system like Apache Kafka or Apache Spark Streaming to process the data in real-time and send it to the REST API.
- Use a dedicated data processing framework: You can use a dedicated data processing framework like Apache Beam or Apache Flink to process the data and send it to the REST API. These frameworks are designed for processing large amounts of data and can handle data ingestion, processing, and delivery.
In general, sending data row-by-row from a UDF can be inefficient and cause performance issues, especially when dealing with large datasets. It's better to use batch processing and asynchronous HTTP requests to improve the performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-02-2023 07:19 AM
@Sanjoy Sen :
It looks like the UDF function is making a synchronous HTTP request to the REST API for each row in the dataframe, which can cause performance issues when processing a large amount of data.
To improve the performance, you can consider the following approaches:
- Batch the data: Instead of sending each row to the REST API, you can batch the data and send multiple rows at once. This will reduce the number of HTTP requests and improve the overall performance. You can use the foreachPartition API of Spark to send the data in batches.
- Use an asynchronous HTTP client: You can use an asynchronous HTTP client like aiohttp or httpx in the UDF function to send the data asynchronously. This will allow you to send multiple HTTP requests in parallel and improve the overall performance.
- Use a distributed system: If the data is too large to be processed on a single machine, you can consider using a distributed system like Apache Kafka or Apache Spark Streaming to process the data in real-time and send it to the REST API.
- Use a dedicated data processing framework: You can use a dedicated data processing framework like Apache Beam or Apache Flink to process the data and send it to the REST API. These frameworks are designed for processing large amounts of data and can handle data ingestion, processing, and delivery.
In general, sending data row-by-row from a UDF can be inefficient and cause performance issues, especially when dealing with large datasets. It's better to use batch processing and asynchronous HTTP requests to improve the performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-03-2023 11:05 PM
@Suteja Kanuri Thanks for your inputs and it really make sense as you mentioned that bulk data sending is better than row-by-row request provided that API support bulk insert. I will also look into the asynchronous http request.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2023 10:04 PM
@Sanjoy Sen : Thats wonderful! 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-18-2024 06:16 AM
can we pass the /api/2.0/pipelines/{pipeline_id} url to start the dlt pipeline inside a asynchronous function,
this is for parallelly executing multiple tables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-03-2023 11:36 PM
Hi @Sanjoy Sen
Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.
Please help us select the best solution by clicking on "Select As Best" if it does.
Your feedback will help us ensure that we are providing the best possible service to you. Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2023 10:40 PM
done.

