Regarding your first question about reducing the scale-down time of Databricks serverless serving, currently, the system is designed to scale down to zero after 30 minutes of inactivity. This is to ensure that instances are kept warm to handle any sudden increase in traffic and to reduce costs for non-24/7 traffic or development environments. Unfortunately, there is no direct way to reduce this time as it is a part of the auto-scaling algorithm used by Databricks.
For your second question about deploying multiple models under a single endpoint, yes, Databricks does support this functionality. You can serve multiple models to a CPU serving endpoint that utilizes Databricks Model Serving. An endpoint can serve any registered Python MLflow model registered in the Model Registry. You can create a single endpoint with multiple models and set the endpoint traffic split between those models. For example, you can have one model (let's call it "current") that gets 90% of the endpoint traffic, while another model (let's call it "challenger") gets the remaining 10% of the traffic. You can also update the traffic split between served models as needed.
https://docs.databricks.com/en/machine-learning/model-serving/serve-multiple-models-to-serving-endpo...