cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Debu-Sinha
Databricks Employee
Databricks Employee

Databricks ships some killer toys for large-language-model work:

  • ai_query for in-warehouse inference

  • Vector Search for lightning-fast retrieval

  • Serving Endpoints for real-time chat

Put them together, though, and you’ll trip over a few booby traps I learned about the hard way.

 

  The surprise Why it hurts
1 A single NULL in CONCAT nukes the whole prompt The LLM never even sees your question
2 similarity_search() only accepts one string Batch jobs grind along row-by-row
3 Calling an endpoint in a loop feels like dial-up Hundreds of prompts = coffee-break latency

Here’s how I dodge each land-mine — code included, copy-paste away.

1 · Vaccinate Your Prompts Against NULL

SQL’s motto is: “If anything is NULL, everybody’s NULL.”
So instead of begging the LLM to ignore missing data, I scrub the prompt string first:

SELECT
  id,
  ai_query(
    'your-endpoint-name',
    CONCAT_WS(' ',
      'Answer from context:',
      COALESCE(context, 'No context.'),
      'Question:',
      COALESCE(question, 'No question.')
    ),
    modelParameters => named_struct('temperature', 0.3, 'max_tokens', 100)
  ) AS response
FROM questions_table;

COALESCE supplies a sensible default; CONCAT_WS quietly skips any leftover NULLs.

Result: every row ships a valid prompt.

2 · Faux-Batch Vector Search

The Vector Search SDK is single-query only. I trick it into “batch mode” with a thread pool:

 

# ‌‌ Parallel similarity_search() ‌‌
from databricks.vector_search import VectorSearchClient, VectorSearchIndex
from concurrent.futures import ThreadPoolExecutor
import logging, time

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("vector-search")

def get_index(endpoint, name):
    vs_client = VectorSearchClient()
    idx_url = next(
        (idx.url for idx in vs_client.list_indexes(endpoint_name=endpoint)
         if idx.name == name), None)
    if not idx_url:
        raise ValueError(f"Index {name} not found")
    return VectorSearchIndex(
        workspace_url="https://your-workspace.cloud.databricks.com",
        index_url=idx_url,
        name=name,
        endpoint_name=endpoint)

def search(index, query, cols, tries=3):
    for n in range(tries):
        try:
            return index.similarity_search(query_text=query, columns=cols, num_results=5)
        except Exception as e:
            if n == tries - 1:
                return {"error": str(e)}
            logger.warning(f"Retry {n+1}: {e}")
            time.sleep(2 ** n)

def batch_search(queries, endpoint="my-endpoint", idx="my-index", workers=20):
    index = get_index(endpoint, idx)
    with ThreadPoolExecutor(max_workers=workers) as pool:
        futs = [pool.submit(search, index, q, ["id", "text", "metadata"]) for q in queries]
    return [f.result() for f in futs]​

Twenty threads on the driver give me a 10–20× speed-up versus a plain for-loop, with back-off retries to smooth over momentary blips.

3 · Fire-Hose Calls to an LLM Endpoint

Exact same threading trick, but wrapped around WorkspaceClient so I can send system + user prompts together:

 

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ChatMessage, ChatMessageRole
from concurrent.futures import ThreadPoolExecutor
import logging, time

log = logging.getLogger("llm")
log.setLevel(logging.INFO)

class FastLLM:
    def __init__(self, endpoint, workers=10):
        self.endpoint = endpoint
        self.workers = workers
        self.wsc = WorkspaceClient()

    def _ask(self, sys_msg, user_msg, tries=3):
        for n in range(tries):
            try:
                resp = self.wsc.serving_endpoints.query(
                    name=self.endpoint,
                    messages=[
                        ChatMessage(role=ChatMessageRole.SYSTEM, content=sys_msg),
                        ChatMessage(role=ChatMessageRole.USER, content=user_msg)
                    ],
                    max_tokens=200,
                    temperature=0.2
                )
                return {"content": resp.choices[0].message.content, "error": None}
            except Exception as e:
                if n == tries - 1:
                    return {"content": None, "error": str(e)}
                time.sleep(2 ** n)

    def ask_many(self, prompts, sys_msg="You are a helpful assistant"):
        with ThreadPoolExecutor(max_workers=self.workers) as pool:
            futs = [pool.submit(self._ask, sys_msg, p) for p in prompts]
        return [f.result() for f in futs]

# Demo
if __name__ == "__main__":
    engine = FastLLM("my-endpoint", workers=10)
    answers = engine.ask_many([
        "What’s the capital of France?",
        "Explain machine learning in one sentence.",
        "Write a haiku about mountains."
    ])
    for a in answers:
        print(a["content"] or a["error"])

Ten threads is my comfort zone: quick yet gentle enough to dodge rate-limits. Scale up or chunk the inputs once you see how your endpoint behaves.

TL;DR

  • Sanitize prompts in SQL, not in the model.

  • Threads beat async in a Databricks notebook for I/O-heavy jobs.

  • Reuse connections and sprinkle in exponential back-off; half the “random” failures vanish.

Steal these snippets, remix them, and let me know what other hurdles you run into. Always happy to swap tips — just tag me on LinkedIn

Happy building!

Contributors