Databricks

phi_alpaca · ‎02-15-2024

Hello,

I've been trying to serve registered MLflow models at GPU Model Serving Endpoint, which works except for the models using bitsandbytes library. The library is used to quantise the LLM models into 4-bit/ 8-bit (e.g. Mistral-7B), however, it runs into error while registering at endpoint. This error is shown in the service log:

All libraries needed are registered in the requirements.txt files, it looks like one option to fix the error is to run a bash script to help it locate the right path of the package, but we're not able to do so at serving endpoint.

Has anyone successfully served a quantised LLM model at Databricks model serving using bitsandbytes? If so, how do you get around it? Any help on the topic would be much appreciated.

Thanks

Kaniz · ‎02-16-2024

Hi @phi_alpaca, Serving quantized LLM (Large Language Models) using Databricks Model Serving can be a powerful way to optimize performance and reduce latency.

Let’s explore some options:

Databricks Model Serving with GPU and LLM Optimization:
- Databricks Model Serving now supports GPU and LLM optimization, making it easier to deploy AI models, including LLMs and Vision models, on the Lakehouse Platform.
- With this launch, you can deploy open-source or custom AI models, and Databricks Model Serving automatically optimizes them for LLM serving without any manual configuration.
- It’s a serverless GPU serving product developed on a unified data and AI platform, allowing end-to-e...¹.
- You can log your model with MLflow, and Databricks Model Serving will prepare a production-ready con...¹.
Quantized LLMs with bitsandbytes:
- The bitsandbytes library provides tools for quantizing language models, which can significantly reduce memory usage.
- One approach is to use QLoRA (Quantized Low Rank Adapters), which backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA).
- QLoRA allows fine-tuning of large language models on limited GPU memory while preserving performance ².
- You can explore the Google Colab notebook to see how to use 4-bit models in inference and run GPT-neo-X (a 20B parameter model) on a free Goo...².
Troubleshooting the Error:
- If you’re encountering an error while registering the model at the serving endpoint, consider the following steps:
  - Check Dependencies: Ensure that all required libraries (including bitsandbytes) are correctly registered in the requirements.txt file.
  - Path Issues: If the error is related to locating the package path, consider whether the serving environment has the necessary paths set up. Unfortunately, running a bash script directly at the serving endpoint might not be feasible.
  - Model Registration: Double-check the model registration process. Sometimes issues arise during model logging or deployment.
  - Logs and Debugging: Review the service logs for more specific error messages. Debugging logs can provide insights into the root cause.
  - Community Support: Reach out to the Databricks community or forums. Others who have encountered similar issues might have valuable insights or workarounds.

G-M · ‎02-20-2024

Hi @phi_alpaca , we are facing exactly the same issue trying to serve a bitsandbytes quantized version of Mixtral-8x7B . Did you have any progress resolving this? The answer from @Kaniz isn't too helpful and seems to be AI-generated...

As you say, the deployed container is such a black box that we can't take the diagnostic steps listed in the error output.

phi_alpaca · ‎02-20-2024

Hey @G-M , thanks for sharing your experience as well. Unfortunately I haven't had any luck on my end for resolving this. Would be interested to know if you have any breakthrough down the line. Is it something Databricks might be able to put a small fix in please? @Kaniz
Thanks

JAgreenskylake · ‎02-22-2024

Hi, @phi_alpaca have you managed to solve this? We have a similar issue..

phi_alpaca · ‎02-22-2024

Hey @JAgreenskylake , no luck so far. I have been working around it by not using quantised models, which is not ideal, so really hope it's possible to do that soon.

G-M · ‎02-26-2024

@phi_alpaca

We have solved it by providing a conda_env.yaml when we log the model, all we needed was to add cudatoolkit=11.8 to the dependencies.

phi_alpaca · ‎02-27-2024

Thanks so much for sharing and glad it worked out for you guys!
I will have a go and feed back.

phi_alpaca · ‎03-07-2024

I seem to have some compatibility issues with cudatoolkit=11.8, would it be possible for you share what versions you use for torch, transformers, accelerate, and bitsandbytes? Thanks!