cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Limiting parallelism when external APIs are invoked (i.e. mlflow)

Edmondo
New Contributor III

We are applying a groupby operation to a pyspark.sql.Dataframe and then on each group train a single model for mlflow. We see intermittent failures because the MLFlow server replies with a 429, because of too many requests/s

What are the best practices in those cases, and how do you limit the outgoing invocations of an external service? We are using managed MLFlow in Databricks, is there a way that we can configure mlflow so that it queues subsequent requests before sending them to the server?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

at least in Azure MLFlow limits are quite strict per workspace:

  • Low throughput experiment management (list, update, delete, restore): 7 qps
  • Search runs: 7 qps
  • Log batch: 47 qps
  • All other APIs: 127 qps

qps - queries per second. In addition, there is a limit of 20 concurrent model versions in Pending status (in creation) per workspace. Additionally 429 are automatically retry.

Is models trained in parallel for every group? Maybe instead of parallel just train one group by one and monitor executor usage as anyway it can be close to 100% and can take the same time.

View solution in original post

11 REPLIES 11

Hubert-Dudek
Esteemed Contributor III

at least in Azure MLFlow limits are quite strict per workspace:

  • Low throughput experiment management (list, update, delete, restore): 7 qps
  • Search runs: 7 qps
  • Log batch: 47 qps
  • All other APIs: 127 qps

qps - queries per second. In addition, there is a limit of 20 concurrent model versions in Pending status (in creation) per workspace. Additionally 429 are automatically retry.

Is models trained in parallel for every group? Maybe instead of parallel just train one group by one and monitor executor usage as anyway it can be close to 100% and can take the same time.

Edmondo
New Contributor III

Thanks, the limits are the same by documentation for AWS (I had checked that). So there are three options:

  • udf can be applied at maximum at 7 in parallel (how do I do it?)
  • mlflow calls must be queued (again, how do I add a stateful queue across all cluster nodes?)
  • or I can use some sort of locking/coordination mechanism (is there anything active or I should set up a Zookeeper instance?)

Thanks

Anonymous
Not applicable

@Edmondo Porcu​ - My name is Piper, and I'm a moderator for Databricks. I apologize for taking so long to respond. We are looking for the best person to help you.

Kaniz
Community Manager
Community Manager

Hi @Edmondo Porcu​ , 429 is an HTTP response status code that indicates the client application has surpassed its rate limit or the number of requests they can send in a given period of time.

Please go through these similar threads:-

Edmondo
New Contributor III

That's why I asked how to limit parallelism, not what a 429 error mean 🙂 That one I already know. We got an answer from our RA that we pay as professional customers, it looks like this community is pretty useless if the experts do not partecipate 😞

Kaniz
Community Manager
Community Manager

Hi @Edmondo Porcu​ , I understand your concern. Did this thread with the somewhat similar issue not help?

Edmondo
New Contributor III

Yes. I confirm there is no sign of an answer to my question: "how to limit parallelism"

Kaniz
Community Manager
Community Manager

Hi @Edmondo Porcu​ , Thank you for the clarification. We'll try our best to resolve the issue as soon as possible.

Edmondo
New Contributor III

To me it's already resolved through professional services. The question I do have is how useful is this community if people with the right background aren't here, and if it takes a month to get a no-answer.

Kaniz
Community Manager
Community Manager

Hi @Edmondo Porcu​ , Would you like to provide the resolution here in order to help other community members?

Anonymous
Not applicable

@Edmondo Porcu​ - Thank you for your feedback letting us know about your concerns. I apologize for you having to wait so long. We are working on our procedures to alleviate the situation.

Thanks again!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.