Endpoint creation without scale-to-zero
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-14-2024 10:43 PM
Hi, I've got a question about deploying an endpoint for Llama 3.1 8b. The following code should create the endpoint without scale-to-zero. The endpoint is being created, but with scale-to-zero, although scale_to_zero_enabled is set to False. Instead of a boolean, I have also tried to pass the value as a string (both upper and lower case), but unfortunately this does not change the result. What do I have to change so that scale-to-zero is really deactivated?
cl = mlflow.deployments.get_deploy_client("databricks")
cl.create_endpoint(
name="llama3_1_8b_instruct",
config={
"served_entities": [
{
"entity_name": "system.ai.meta_llama_v3_1_8b_instruct",
"entity_version": "2",
"max_provisioned_throughput": 12000,
"scale_to_zero_enabled": False,
}
],
"traffic_config": {
"routes": [
{
"served_model_name": "meta_llama_v3_1_8b_instruct-2",
"traffic_percentage": "100",
}
]
},
},
)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2024 08:16 AM
Can you try with the following:
from mlflow.deployments import get_deploy_client
client = get_deploy_client("databricks")
endpoint = client.create_endpoint(
name="llama3_1_8b_instruct",
config={
"served_entities": [
{
"name": "llama3_1_8b_instruct-entity",
"entity_name": "system.ai.meta_llama_v3_1_8b_instruct",
"entity_version": "2",
"workload_size": "Small",
"scale_to_zero_enabled": False
}
],
"traffic_config": {
"routes": [
{
"served_model_name": "llama3_1_8b_instruct-entity",
"traffic_percentage": 100
}
]
}
}
)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2024 01:57 AM
Thanks for the reply @Walter_C. This didn't quite work, since it used a CPU and didn't consider the max_provisioned_throughput, but I finally got it to work like this:
from mlflow.deployments import get_deploy_client
client = get_deploy_client("databricks")
endpoint = client.create_endpoint(
name="llama3_1_8b_instruct-test",
config={
"served_entities": [
{
"name": "llama3_1_8b_instruct-entity",
"entity_name": "system.ai.meta_llama_v3_1_8b_instruct",
"entity_version": "2",
"scale_to_zero_enabled": "false",
"min_provisioned_throughput": 12000,
"max_provisioned_throughput": 12000
}
],
"traffic_config": {
"routes": [
{
"served_model_name": "llama3_1_8b_instruct-entity",
"traffic_percentage": 100
}
]
}
}
)