Databricks ships some killer toys for large-language-model work:
-
ai_query
for in-warehouse inference -
Vector Search for lightning-fast retrieval
-
Serving Endpoints for real-time chat
Put them together, though, and you’ll trip over a few booby traps I learned about the hard way.
The surprise | Why it hurts | |
---|---|---|
1 | A single NULL in CONCAT nukes the whole prompt |
The LLM never even sees your question |
2 | similarity_search() only accepts one string |
Batch jobs grind along row-by-row |
3 | Calling an endpoint in a loop feels like dial-up | Hundreds of prompts = coffee-break latency |
Here’s how I dodge each land-mine — code included, copy-paste away.
1 · Vaccinate Your Prompts Against NULL
SQL’s motto is: “If anything is NULL
, everybody’s NULL
.”
So instead of begging the LLM to ignore missing data, I scrub the prompt string first:
SELECT
id,
ai_query(
'your-endpoint-name',
CONCAT_WS(' ',
'Answer from context:',
COALESCE(context, 'No context.'),
'Question:',
COALESCE(question, 'No question.')
),
modelParameters => named_struct('temperature', 0.3, 'max_tokens', 100)
) AS response
FROM questions_table;
COALESCE
supplies a sensible default; CONCAT_WS
quietly skips any leftover NULL
s.
Result: every row ships a valid prompt.
2 · Faux-Batch Vector Search
The Vector Search SDK is single-query only. I trick it into “batch mode” with a thread pool:
# Parallel similarity_search()
from databricks.vector_search import VectorSearchClient, VectorSearchIndex
from concurrent.futures import ThreadPoolExecutor
import logging, time
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("vector-search")
def get_index(endpoint, name):
vs_client = VectorSearchClient()
idx_url = next(
(idx.url for idx in vs_client.list_indexes(endpoint_name=endpoint)
if idx.name == name), None)
if not idx_url:
raise ValueError(f"Index {name} not found")
return VectorSearchIndex(
workspace_url="https://your-workspace.cloud.databricks.com",
index_url=idx_url,
name=name,
endpoint_name=endpoint)
def search(index, query, cols, tries=3):
for n in range(tries):
try:
return index.similarity_search(query_text=query, columns=cols, num_results=5)
except Exception as e:
if n == tries - 1:
return {"error": str(e)}
logger.warning(f"Retry {n+1}: {e}")
time.sleep(2 ** n)
def batch_search(queries, endpoint="my-endpoint", idx="my-index", workers=20):
index = get_index(endpoint, idx)
with ThreadPoolExecutor(max_workers=workers) as pool:
futs = [pool.submit(search, index, q, ["id", "text", "metadata"]) for q in queries]
return [f.result() for f in futs]
Twenty threads on the driver give me a 10–20× speed-up versus a plain for-loop, with back-off retries to smooth over momentary blips.
3 · Fire-Hose Calls to an LLM Endpoint
Exact same threading trick, but wrapped around WorkspaceClient
so I can send system + user prompts together:
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ChatMessage, ChatMessageRole
from concurrent.futures import ThreadPoolExecutor
import logging, time
log = logging.getLogger("llm")
log.setLevel(logging.INFO)
class FastLLM:
def __init__(self, endpoint, workers=10):
self.endpoint = endpoint
self.workers = workers
self.wsc = WorkspaceClient()
def _ask(self, sys_msg, user_msg, tries=3):
for n in range(tries):
try:
resp = self.wsc.serving_endpoints.query(
name=self.endpoint,
messages=[
ChatMessage(role=ChatMessageRole.SYSTEM, content=sys_msg),
ChatMessage(role=ChatMessageRole.USER, content=user_msg)
],
max_tokens=200,
temperature=0.2
)
return {"content": resp.choices[0].message.content, "error": None}
except Exception as e:
if n == tries - 1:
return {"content": None, "error": str(e)}
time.sleep(2 ** n)
def ask_many(self, prompts, sys_msg="You are a helpful assistant"):
with ThreadPoolExecutor(max_workers=self.workers) as pool:
futs = [pool.submit(self._ask, sys_msg, p) for p in prompts]
return [f.result() for f in futs]
# Demo
if __name__ == "__main__":
engine = FastLLM("my-endpoint", workers=10)
answers = engine.ask_many([
"What’s the capital of France?",
"Explain machine learning in one sentence.",
"Write a haiku about mountains."
])
for a in answers:
print(a["content"] or a["error"])
TL;DR
-
Sanitize prompts in SQL, not in the model.
-
Threads beat async in a Databricks notebook for I/O-heavy jobs.
-
Reuse connections and sprinkle in exponential back-off; half the “random” failures vanish.
Steal these snippets, remix them, and let me know what other hurdles you run into. Always happy to swap tips — just tag me on LinkedIn.
Happy building!