Endpoint creation without scale-to-zero

damselfly20
New Contributor III

Hi, I've got a question about deploying an endpoint for Llama 3.1 8b. The following code should create the endpoint without scale-to-zero. The endpoint is being created, but with scale-to-zero, although scale_to_zero_enabled is set to False. Instead of a boolean, I have also tried to pass the value as a string (both upper and lower case), but unfortunately this does not change the result. What do I have to change so that scale-to-zero is really deactivated?

 

cl = mlflow.deployments.get_deploy_client("databricks")
cl.create_endpoint(
   name="llama3_1_8b_instruct",
   config={
       "served_entities": [
           {
               "entity_name": "system.ai.meta_llama_v3_1_8b_instruct",
               "entity_version": "2",
               "max_provisioned_throughput": 12000,
               "scale_to_zero_enabled": False,
           }
       ],
       "traffic_config": {
           "routes": [
               {
                   "served_model_name": "meta_llama_v3_1_8b_instruct-2",
                   "traffic_percentage": "100",
               }
           ]
       },
   },
)

 

Walter_C
Databricks Employee
Databricks Employee

Can you try with the following:

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")

endpoint = client.create_endpoint(
    name="llama3_1_8b_instruct",
    config={
        "served_entities": [
            {
                "name": "llama3_1_8b_instruct-entity",
                "entity_name": "system.ai.meta_llama_v3_1_8b_instruct",
                "entity_version": "2",
                "workload_size": "Small",
                "scale_to_zero_enabled": False
            }
        ],
        "traffic_config": {
            "routes": [
                {
                    "served_model_name": "llama3_1_8b_instruct-entity",
                    "traffic_percentage": 100
                }
            ]
        }
    }
)

damselfly20
New Contributor III

Thanks for the reply @Walter_C. This didn't quite work, since it used a CPU and didn't consider the max_provisioned_throughput, but I finally got it to work like this:

 
from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")

endpoint = client.create_endpoint(
    name="llama3_1_8b_instruct-test",
    config={
        "served_entities": [
            {
                "name": "llama3_1_8b_instruct-entity",
                "entity_name": "system.ai.meta_llama_v3_1_8b_instruct",
                "entity_version": "2",
                "scale_to_zero_enabled": "false",
                "min_provisioned_throughput": 12000,
                "max_provisioned_throughput": 12000
            }
        ],
        "traffic_config": {
            "routes": [
                {
                    "served_model_name": "llama3_1_8b_instruct-entity",
                    "traffic_percentage": 100
                }
            ]
        }
    }
)