A company's AI enrichment agent is a critical tool for sales, marketing, and research teams. Whether you're building prospect lists, conducting market research, or analyzing competitors, you need accurate, up-to-date information about companies: their headquarters, funding history, team size, investors, and founders.
Manual research is time-consuming and doesn't scale. What if you could automate this process using AI agents that intelligently search the web, extract relevant information, and structure it in a database, all while running in a production-grade data platform?
In this post, I'll show you how to build a company enrichment agent using:
The result is a powerful, automated system that can enrich hundreds of companies with minimal manual effort - combining real-time web intelligence with production-grade governance and observability.
Sales, marketing, and research teams constantly need to enrich company data.
For example, you want to gather information like:
For a list of even 100 companies, the manual process is painful:
This isn't a one-time task. Company data changes constantly - new funding rounds, leadership transitions, headcount growth. What took days to compile becomes stale in weeks, forcing you to repeat the entire process.
Our approach uses an AI agent with a two-step strategy powered by Nimble's Real-Time Search API:
Traditional index-based search APIs might return stale, cached results that can be days or weeks old. For data enrichment, this is a critical problem:
Nimble takes a fundamentally different approach: real-time web browsing. Instead of querying a static index, Nimble spins up headless browsers that navigate live websites, extracting fresh data in real-time, an important distinction if you’re comparing approaches commonly used in RAG systems
While Nimble provides the real-time external context from the web, Databricks provides the governed intelligence and observability:
Model Flexibility: By using Databricks Model Serving (via `ChatDatabricks`), you're not locked into any specific provider. Use models from Anthropic, OpenAI, Google, Meta, and more with a unified interface, billing, guardrails, and enterprise security. Swap models without rewriting code.
MLflow Observability: Track and debug agent execution with MLflow—see which tools were called, compare different models or retrieval approaches, and quickly diagnose issues as you iterate on your enrichment pipeline.
Unity Catalog Governance: Enriched data is automatically governed under Unity Catalog with full lineage tracking, ready for downstream use in dashboards, AI/BI Genie queries, or other data workflows.
This combination makes the difference between a prototype and a production system that scales reliably across your organization.
The agent searches with `deep_search=false`. Nimble's browsers navigate live sites (Crunchbase, LinkedIn, company websites) and return structured JSON data, not HTML to parse. Your agent receives `{"address": "123 Main St", "funding": "$50M"}` with data accurate as of *right now*.
For deeper data needs, Nimble's site-specialized data extraction kicks in. A Crunchbase-trained agent understands exactly where funding data lives. A LinkedIn agent knows company page structure. This site-awareness delivers enterprise-grade accuracy that generic crawlers can't match.
This strategy balances speed (fast search) with completeness (targeted extraction), while real-time browsing ensures data freshness that indexed search vendors simply cannot provide.
First, we create a Delta table to store our companies and their enrichment data:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.types import StructType, StructField, StringType
from delta.tables import DeltaTable
TABLE_NAME = "users.ilanc.company_enrichment_demo"
spark = SparkSession.builder.getOrCreate()
# Sample companies to enrich
companies_data = [
("Anthropic", "anthropic.com"),
("OpenAI", "openai.com"),
("Databricks", "databricks.com"),
("Nimble", "nimbleway.com")
]
# Create DataFrame with enrichment columns
df = spark.createDataFrame(companies_data, ["company_name", "website"])
df = df.withColumn("address", lit(None).cast(StringType())) \
.withColumn("funding", lit(None).cast(StringType())) \
.withColumn("employees", lit(None).cast(StringType())) \
.withColumn("investors", lit(None).cast(StringType())) \
.withColumn("founders", lit(None).cast(StringType())) \
.withColumn("enrichment_status", lit("pending"))
# Save as Delta table
df.write.format("delta").mode("overwrite").saveAsTable(TABLE_NAME)
Next, we set up a LangChain agent with Claude Sonnet 4.5 and Nimble's search/extraction tools (click here to get your API Key) :
import os
import getpass
from typing import List
from pydantic import BaseModel, Field
from databricks_langchain import ChatDatabricks
from langchain.agents import create_agent
from langchain_nimble import NimbleExtractTool, NimbleSearchTool
from langchain.agents.middleware import SummarizationMiddleware
# Set up API key
if not os.environ.get("NIMBLE_API_KEY"):
os.environ["NIMBLE_API_KEY"] = getpass.getpass("NIMBLE_API_KEY:\n")
# Here you switch between models from Anthropic, OpenAI, Google, Meta...
llm_model = ChatDatabricks(endpoint="databricks-claude-sonnet-4-5")
# Define the agent's strategy
prompt_template = """
You are a company enrichment agent. Use this two-step approach:
**Step 1: Fast Search**
Use search_tool with deep_search=false to get quick snippets and URLs.
Extract as much information as possible from the snippets.
**Step 2: Targeted Extraction (if needed)**
If information is missing, use extract_tool on 1-2 relevant URLs from the search results.
Focus on official company websites, LinkedIn, or Crunchbase.
**Required Information:**
- address: Full headquarters address
- funding: Total funding (e.g., "$100M Series B")
- employees: Count or range (e.g., "500-1000")
- investors: List of investor names
- founders: List of founder names
Return "Not found" for missing strings, empty list [] for missing arrays.
"""
# Define structured output schema
class CompanyInfo(BaseModel):
"""Company enrichment information"""
address: str = Field(description="Company headquarters address")
funding: str = Field(description="Total funding raised")
employees: str = Field(description="Employee count or range")
investors: List[str] = Field(description="List of investors")
founders: List[str] = Field(description="List of founders")
# Create agent
agent = create_agent(
model=llm_model,
tools=[NimbleSearchTool(), NimbleExtractTool()],
system_prompt=prompt_template,
response_format=CompanyInfo
)
The enrichment function calls the agent and returns structured data:
import json
async def enrich_company(company_name: str, website: str) -> dict:
"""Use agent to enrich company data with structured output"""
try:
query = f"Find address, funding, employees, investors, and founders for {company_name} (website: {website})"
# Stream agent execution
async for step in agent.astream(
{"messages": [{"role": "user", "content": query}]},
stream_mode="values",
):
pass # Process streaming steps silently
# Extract structured response
structured = step["structured_response"]
result = structured.model_dump()
# Convert lists to JSON strings for Delta table storage
result = {
"address": result.get("address", "Not found"),
"funding": result.get("funding", "Not found"),
"employees": result.get("employees", "Not found"),
"investors": json.dumps(result.get("investors", [])),
"founders": json.dumps(result.get("founders", []))
}
return result
except Exception as e:
print(f"❌ Error enriching {company_name}: {str(e)}")
return {
"address": "Error",
"funding": "Error",
"employees": "Error",
"investors": "[]",
"founders": "[]"
}
Finally, we process all pending companies and update the Delta table:
from pyspark.sql.functions import col, lit
delta_table = DeltaTable.forName(spark, TABLE_NAME)
pending = spark.table(TABLE_NAME).filter(col("enrichment_status") == "pending").collect()
print(f"🚀 Enriching {len(pending)} companies...\n")
success_count = 0
error_count = 0
for idx, row in enumerate(pending, 1):
company = row.company_name
website = row.website
print(f"[{idx}/{len(pending)}] Processing {company}...")
try:
# Enrich with agent
data = await enrich_company(company, website)
# Update Delta table
delta_table.update(
condition=col("company_name") == company,
set={
"address": lit(data["address"]),
"funding": lit(data["funding"]),
"employees": lit(data["employees"]),
"investors": lit(data["investors"]),
"founders": lit(data["founders"]),
"enrichment_status": lit("completed" if data["address"] != "Error" else "failed")
}
)
if data["address"] != "Error":
print(f" ✅ {data['address']}\n")
success_count += 1
else:
error_count += 1
except Exception as e:
print(f" ❌ Unexpected error: {str(e)}\n")
error_count += 1
print(f"🎉 Complete! Success: {success_count}, Failed: {error_count}")
The basic implementation works well for small datasets, but what if you need to enrich 10,000 companies?
Here are two critical improvements for production-scale workloads.
The Problem: Our current implementation processes companies sequentially, one at a time. For 1,000 companies, this could take hours.
The Solution: Use Spark's `pandas_udf` to distribute enrichment across your cluster. Wrap your agent logic in a UDF that processes rows in parallel across multiple nodes.
Key optimization: Initialize the agent once per partition (not per row) to avoid overhead.
Results: 10-100x speedup depending on cluster size. Databricks autoscaling clusters dynamically adjust resources based on workload.
For complex enrichment needs, a single agent handling all tasks faces context engineering challenges. When one agent must extract addresses, funding, investors, founders, and executives, its prompt becomes bloated with instructions for every domain, leading to lower accuracy and higher hallucination rates.
Split enrichment into domain-expert agents, each with laser-focused prompts and specialized knowledge:
Each agent knows the authoritative sources for its domain. The funding agent doesn't waste tokens searching LinkedIn profiles; the people agent skips Crunchbase tables. This source awareness prevents agents from searching irrelevant sources and improves accuracy by focusing on where reliable data actually lives.
A supervisor agent coordinates the specialized agents, ensuring consistency (handling alternative company names), detecting conflicts between agents, and assigning confidence scores based on cross-agent validation.
Anthropic's research on multi-agent systems shows 90%+ performance improvements when using specialized agents coordinated by a supervisor, compared to single generalist agents. However, multi-agent systems consume 10-15x more tokens due to parallel execution and require more sophisticated error handling.
When to Use Multi-Agent? Best for complex, high-value enrichment where accuracy justifies the cost. For simpler use cases like the 5-field enrichment in this tutorial, the single-agent approach is sufficient and cost-effective.
Cost Optimization with Nimble: Use `deep_search=false` for initial searches—this is significantly cheaper and faster (typically 80% cost reduction) while still providing rich snippets. Only use `deep_search=true` or targeted extraction when you need full page content. This two-tier approach minimizes API costs while maximizing data completeness.
Incremental Enrichment: Add `last_updated` timestamps and only re-enrich stale data (>30 days old) to reduce costs.
Smart Caching: Store search results in Delta tables to avoid redundant API calls for the same queries.
Confidence Scoring: Route low-confidence results (<0.8) to a human review queue for validation.
Error Handling: Implement retry logic with exponential backoff for transient failures, and use async/await patterns for I/O-bound operations to maximize throughput; this sits alongside broader production concerns like orchestration and monitoring data pipeline tools.
Building an AI company enrichment agent is surprisingly straightforward with the right tools. By combining Databricks' data platform, LangChain's agent framework, Nimble's Real-Time Search API, and Claude Sonnet's reasoning, you can automate hours of manual research.
The two-step search strategy (fast search → targeted extraction) is particularly powerful because Nimble handles the hardest part of web data collection. With real-time web browsing, proprietary JavaScript rendering, site-specialized agents, and structured JSON output, Nimble delivers parse-ready data that your AI agents can immediately reason over. No HTML parsing, no browser management, no site-specific scrapers to maintain.
Structured output with Pydantic ensures data quality, while Delta tables provide production-grade storage with ACID guarantees. The result: a scalable, reliable enrichment pipeline that processes thousands of companies without manual intervention.
You can adapt this pattern for other enrichment tasks: contact enrichment, product research, competitive analysis, or market intelligence gathering, anywhere you need AI agents to autonomously gather web data at scale.
Ready to try it yourself? Check out the full notebook on GitHub and start enriching your own company data with Nimble's Search API.
---
Have questions or want to share your enrichment use case? Drop a comment below!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.