โ03-29-2023 02:42 AM
Hi All,
I am facing some performance issue with one of pyspark udf function that post data to REST API(uses cosmos db backend to store the data).
Please find the details below:
# The spark dataframe(df) contains near about 30-40k data.
# I am using python udf function to post it over rest api:
Ex. final_df = df.withColumn('status', save_data('A', 'B', 'C'))
# udf function:
@udf(returnType = IntegerType())
def save_data(A, B, C):
post_data = list()
post_data.append({
'A': A,
'B': B,
'C': C,
})
retry_strategy = Retry(
total = 5,
status_forcelist = [400, 500, 502, 503, 504],
method_whitelist = ['POST'],
backoff_factor = 0.1
)
adapter = HTTPAdapter(max_retries = retry_strategy)
s = requests.Session()
s.mount('https://', adapter)
s.mount('http://', adapter)
s.keep_alive = False
try:
response = s.post(
url = rest_api_url,
headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},
data = json.dumps(post_data)
)
return response.status_code
except:
response = requests.post(
url = rest_api_url,
headers = {'Authorization': 'Bearer ' + api_token,'Content-Type': "application/json"},
data = json.dumps(post_data)
)
return response.status_code
# Issue: Databricks jobs gets hanged for infinite time at rest api call(save_data()) and never succeeded.
# When checked from API end, its showing the service touching maximum resource utilization(100%).
To me it looks like the python udf is sending bulk data at a time which overwhelmed the api service at some point of time and it stopped responding.
What would be the best way we can overcome this problem?
Should we split the dataframe into multiple chunks and send it out one-by-one or
convert it to pandas df and then send out row-by-row (might be slow)
Kindly suggest.
โ04-02-2023 07:19 AM
@Sanjoy Senโ :
It looks like the UDF function is making a synchronous HTTP request to the REST API for each row in the dataframe, which can cause performance issues when processing a large amount of data.
To improve the performance, you can consider the following approaches:
In general, sending data row-by-row from a UDF can be inefficient and cause performance issues, especially when dealing with large datasets. It's better to use batch processing and asynchronous HTTP requests to improve the performance.
โ04-02-2023 07:19 AM
@Sanjoy Senโ :
It looks like the UDF function is making a synchronous HTTP request to the REST API for each row in the dataframe, which can cause performance issues when processing a large amount of data.
To improve the performance, you can consider the following approaches:
In general, sending data row-by-row from a UDF can be inefficient and cause performance issues, especially when dealing with large datasets. It's better to use batch processing and asynchronous HTTP requests to improve the performance.
โ04-03-2023 11:05 PM
@Suteja Kanuriโ Thanks for your inputs and it really make sense as you mentioned that bulk data sending is better than row-by-row request provided that API support bulk insert. I will also look into the asynchronous http request.
โ04-05-2023 10:04 PM
@Sanjoy Senโ : Thats wonderful! ๐
4 weeks ago
can we pass the /api/2.0/pipelines/{pipeline_id} url to start the dlt pipeline inside a asynchronous function,
this is for parallelly executing multiple tables.
โ04-03-2023 11:36 PM
Hi @Sanjoy Senโ
Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.
Please help us select the best solution by clicking on "Select As Best" if it does.
Your feedback will help us ensure that we are providing the best possible service to you. Thank you!
โ04-05-2023 10:40 PM
done.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group