Databricks Community

lucaperes · ‎02-06-2025

Hello, Databricks Community,

I am experiencing an issue while trying to serve a quantized model in gguf format using Databricks serving with the llama-cpp-python library.

The model is registered using MLflow and pyfunc on Unity. The model loads without any issues using the load function, indicating that the registration and initial configuration are correct.

The problem arises during the creation of the inference endpoint. Although the model is registered and loaded, I am unable to create the endpoint necessary to perform predictions. The following logs are produced:

[4cmcn] [2025-02-03 15:02:53 +0000] [12033] [INFO] Booting worker with pid: 12033
[4cmcn] [2025-02-03 15:02:55 +0000] [9] [ERROR] Worker (pid:12018) was sent code 132!
[4cmcn] [2025-02-03 15:02:55 +0000] [12045] [INFO] Booting worker with pid: 12045
[4cmcn] [2025-02-03 15:02:55 +0000] [9] [ERROR] Worker (pid:12027) was sent code 132!
[4cmcn] [2025-02-03 15:02:55 +0000] [12049] [INFO] Booting worker with pid: 12049
[4cmcn] [2025-02-03 15:02:56 +0000] [9] [ERROR] Worker (pid:12030) was sent code 132!
[4cmcn] [2025-02-03 15:02:56 +0000] [12062] [INFO] Booting worker with pid: 12062
[4cmcn] [2025-02-03 15:02:56 +0000] [9] [ERROR] Worker (pid:12033) was sent code 132!

Code:

%pip install tkmacosx>=1.0.5
%pip install pynput>=1.7.7
%pip install llama-cpp-python>=0.3.6
%pip install pyperclip>=1.9.0
%pip install transformers>=4.46.2
%pip install pygments>=2.19.1
%pip install cloudpickle>=3.1.1
%pip install mlflow>=2.20.1

from mlflow.models.signature import infer_signature
import mlflow
from typing import Generator, List, Dict, Any, Union, Tuple
from llama_cpp import Llama
from collections import deque
import os
from pathlib import Path

class ChatModelWrapper(mlflow.pyfunc.PythonModel):
    def __init__(self):
        self.model_path = None
        self.model = None
    
    def load_context(self, context):
        self.model_path = "/Volumes/ml_lab/generativo/models/granite-3.1-3b-a800m-instruct-Q6_K.gguf"
        self.model = Llama(
            self.model_path,
            n_ctx=8192,
            verbose=False,
            n_threads=8
         )
        
    def create_chat_completion(
                self,
                messages: List[Dict[str, str]], 
                temperature: float = 0.4,
                top_p: float = 0.9,
                top_k: int = 50,
                repeat_penalty: float = 1.2,
                max_tokens: int = 256
    ) -> Generator[str, None, None]:
        """Helper method to create chat completions with standard parameters"""
        if self.model is None:
            raise ValueError("O modelo Llama deve ser fornecido como argumento.")

        output = ""
        for chunk in self.model.create_chat_completion(
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            repeat_penalty=repeat_penalty,
            stream=True
        ):

            content = chunk['choices'][0]['delta'].get('content')
            if content:
                if content in ["<end_action>", "<|endoftext|>"]:
                    break
                output += content
                yield output

    def process(self,messages):
        """Processa as mensagens e gera a resposta."""
        response = ""
        for chunk in self.create_chat_completion(messages, max_tokens=2024):
            response = chunk
            yield chunk

    def get_answer(self,messages):
        """Retorna a resposta final da conversa."""
        try:
            return deque(self.process(messages), maxlen=1).pop()
        except IndexError:
            return ""
        
    def predict(self, context, model_input: List[Dict[str,str]]) -> Dict[str,str]:
        """Gera respostas para multiplas entradas."""
        return {"answer": self.get_answer(model_input)}

# Carregando o modelo Llama fora da função de predição
# Isso evita recarregar o modelo a cada predição e economiza recursos
messages = [{'role': 'system',
  'content': 'Você é um assistente que fala português e responde perguntas do usuario baseado no conteudo fornecido.'},
 {'role': 'user', 'content': 'Oi'}]
signature = infer_signature(messages, {"answer": "Olá, tudo bem?"})


mlflow.set_registry_uri("databricks-uc")
with mlflow.start_run():
   model_info = mlflow.pyfunc.log_model(
       python_model=ChatModelWrapper(),
       artifact_path="model",
       registered_model_name="ml_lab.generativo.granite",
       pip_requirements=["tkmacosx>=1.0.5",
                        "pynput>=1.7.7",
                        "llama-cpp-python>=0.3.6",
                        "pyperclip>=1.9.0",
                        "transformers>=4.46.2",
                        "pygments>=2.19.1"],
    )

model = mlflow.pyfunc.load_model(model_info.model_uri)
print(model.predict(messages))

I would like to know if there are any specific guidelines or known adjustments that could resolve this issue. Any help in diagnosing and resolving this matter would be greatly appreciated.

mark_ott · 3 weeks ago

The error code 132 typically means an illegal instruction was encountered, often caused by a CPU incompatibility with the code being executed—especially with libraries that use SIMD or hardware acceleration (e.g., llama-cpp-python, which is often compiled for AVX/AVX2 instruction sets). This is a common issue when deploying quantized LLMs using llama-cpp-python in containerized or cloud environments like Databricks. Below are the most relevant guidelines and adjustments for your scenario.

Key Diagnostics and Guidelines

CPU Compatibility: llama-cpp-python builds are optimized for specific CPU features (like AVX2, AVX, or SSE). If your Databricks cluster runs on machines that do not support the required instruction set (for example, ARM or older x86 processors), workers will crash with code 132.
- Check the hardware spec of your serving environment. You might need to recompile llama-cpp-python with the appropriate flags for the CPUs in your Databricks cluster.
- You can recompile as follows:
text

CMAKE_ARGS='-DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_FMA=off' pip install llama-cpp-python --force-reinstall --no-cache-dir
- If using ARM, additional compilation steps are necessary—refer to official llama-cpp-python documentation.
Serving Mode:
- Ensure your endpoint setup uses a worker environment that matches your model registration environment. Sometimes, Databricks clusters for endpoints have different machine types from the ones you used for development or registration.
- Try registering and serving the model on a cluster with identical worker node types, or use a custom Docker image for serving.
Resource Constraints: Large context sizes (e.g., n_ctx=8192) and high thread counts (n_threads=8) can cause memory or resource exhaustion, potentially crashing the workers. Try with a smaller n_ctx value or with fewer threads (n_threads=1-2) during initial troubleshooting.
Endpoint Logs: Code 132 usually appears in executor logs when the process crashes. Download the full log file and look for earlier entries that specify failed imports or "Illegal instruction" to confirm hardware issues.
Model Location: Storage paths like /Volumes/ml_lab/generativo/models/granite-3.1-3b-a800m-instruct-Q6_K.gguf should be accessible from all endpoint worker nodes. If the endpoint environment doesn't mount this volume, workers may fail.

Recommended Next Steps

Check CPU Info: Run cat /proc/cpuinfo (Linux) in the serving environment to verify instruction set support.
Rebuild llama-cpp-python: With flags specific to your environment; avoid AVX2 if not supported.
Use Compatible Cluster/Endpoint Spec: Ensure the hardware matches or create a custom environment.
Reduce Model Params: Temporarily lower thread/context for debugging.
Log Inspection: Look for 'Illegal Instruction' or import failures in the full log.

Alternative Approaches

Consider using Databricks' GPU endpoints if available, which avoids many CPU-specific issues and often runs containers that support a broader range of hardware.
If you must serve on CPU, restrict model configuration to match the lowest common denominator of your cluster's hardware.

Troubleshooting Table

Suggested Action	Why?	Reference
Rebuild llama-cpp-python	Match CPU instruction set
Lower n_ctx and n_threads	Avoid resource exhaustion
Confirm model path availability	Prevent file access failures
Inspect full endpoint logs	Find root error cause

If these adjustments do not resolve your issue, share your full endpoint (serving) Docker image details and cluster hardware configuration, as this will further isolate the problem.