โ01-17-2022 11:05 AM
We are applying a groupby operation to a pyspark.sql.Dataframe and then on each group train a single model for mlflow. We see intermittent failures because the MLFlow server replies with a 429, because of too many requests/s
What are the best practices in those cases, and how do you limit the outgoing invocations of an external service? We are using managed MLFlow in Databricks, is there a way that we can configure mlflow so that it queues subsequent requests before sending them to the server?
โ01-17-2022 11:49 AM
at least in Azure MLFlow limits are quite strict per workspace:
qps - queries per second. In addition, there is a limit of 20 concurrent model versions in Pending status (in creation) per workspace. Additionally 429 are automatically retry.
Is models trained in parallel for every group? Maybe instead of parallel just train one group by one and monitor executor usage as anyway it can be close to 100% and can take the same time.
โ01-17-2022 11:49 AM
at least in Azure MLFlow limits are quite strict per workspace:
qps - queries per second. In addition, there is a limit of 20 concurrent model versions in Pending status (in creation) per workspace. Additionally 429 are automatically retry.
Is models trained in parallel for every group? Maybe instead of parallel just train one group by one and monitor executor usage as anyway it can be close to 100% and can take the same time.
โ01-17-2022 11:52 AM
Thanks, the limits are the same by documentation for AWS (I had checked that). So there are three options:
Thanks
โ02-16-2022 09:02 AM
@Edmondo Porcuโ - My name is Piper, and I'm a moderator for Databricks. I apologize for taking so long to respond. We are looking for the best person to help you.
โ02-24-2022 07:12 AM
That's why I asked how to limit parallelism, not what a 429 error mean ๐ That one I already know. We got an answer from our RA that we pay as professional customers, it looks like this community is pretty useless if the experts do not partecipate ๐
โ02-24-2022 07:26 AM
Yes. I confirm there is no sign of an answer to my question: "how to limit parallelism"
โ02-24-2022 07:57 AM
To me it's already resolved through professional services. The question I do have is how useful is this community if people with the right background aren't here, and if it takes a month to get a no-answer.
โ02-24-2022 08:35 AM
@Edmondo Porcuโ - Thank you for your feedback letting us know about your concerns. I apologize for you having to wait so long. We are working on our procedures to alleviate the situation.
Thanks again!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group