Databricks Community

ilchemla · ‎02-02-2026

Building a Company Enrichment Agent with Nimble, Databricks, LangChain, and Claude Sonnet

A company's AI enrichment agent is a critical tool for sales, marketing, and research teams. Whether you're building prospect lists, conducting market research, or analyzing competitors, you need accurate, up-to-date information about companies: their headquarters, funding history, team size, investors, and founders.

Manual research is time-consuming and doesn't scale. What if you could automate this process using AI agents that intelligently search the web, extract relevant information, and structure it in a database, all while running in a production-grade data platform?

In this post, I'll show you how to build a company enrichment agent using:

Databricks: For data processing and Delta table storage
LangChain: For agent orchestration
Nimble Real-Time Search API: For live web data extraction (not stale indexed search)
Claude Sonnet 4.5: For reasoning and structured output

The result is a powerful, automated system that can enrich hundreds of companies with minimal manual effort - combining real-time web intelligence with production-grade governance and observability.

The Problem: Manual Company Enrichment Tools Don't Scale

Sales, marketing, and research teams constantly need to enrich company data.

For example, you want to gather information like:

Full headquarters address
Total funding raised
Employee count and growth trajectory
List of investors and latest funding round
Names of founders and key executives

For a list of even 100 companies, the manual process is painful:

Google each company name
Navigate between multiple sources (company website, Crunchbase, LinkedIn, news articles)
Copy-paste data fragments from each site
Normalize and format information consistently
Store it in your database

This isn't a one-time task. Company data changes constantly - new funding rounds, leadership transitions, headcount growth. What took days to compile becomes stale in weeks, forcing you to repeat the entire process.

The Solution: An AI Company Enrichment Agent That Does the Research For You

Our approach uses an AI agent with a two-step strategy powered by Nimble's Real-Time Search API:

Why Real-Time Search Matters

Traditional index-based search APIs might return stale, cached results that can be days or weeks old. For data enrichment, this is a critical problem:

A company's funding status changes overnight
Leadership transitions aren't reflected in cached indexes
New companies don't appear in search results for days

Nimble takes a fundamentally different approach: real-time web browsing. Instead of querying a static index, Nimble spins up headless browsers that navigate live websites, extracting fresh data in real-time, an important distinction if you’re comparing approaches commonly used in RAG systems

Databricks Platform Benefits: Governed Intelligence and Observability

While Nimble provides the real-time external context from the web, Databricks provides the governed intelligence and observability:

Model Flexibility: By using Databricks Model Serving (via `ChatDatabricks`), you're not locked into any specific provider. Use models from Anthropic, OpenAI, Google, Meta, and more with a unified interface, billing, guardrails, and enterprise security. Swap models without rewriting code.

MLflow Observability: Track and debug agent execution with MLflow—see which tools were called, compare different models or retrieval approaches, and quickly diagnose issues as you iterate on your enrichment pipeline.

Unity Catalog Governance: Enriched data is automatically governed under Unity Catalog with full lineage tracking, ready for downstream use in dashboards, AI/BI Genie queries, or other data workflows.

This combination makes the difference between a prototype and a production system that scales reliably across your organization.

Step 1: Fast Real-Time Data Enrichment Search

The agent searches with `deep_search=false`. Nimble's browsers navigate live sites (Crunchbase, LinkedIn, company websites) and return structured JSON data, not HTML to parse. Your agent receives `{"address": "123 Main St", "funding": "$50M"}` with data accurate as of *right now*.

Step 2: Targeted Extraction (if needed)

For deeper data needs, Nimble's site-specialized data extraction kicks in. A Crunchbase-trained agent understands exactly where funding data lives. A LinkedIn agent knows company page structure. This site-awareness delivers enterprise-grade accuracy that generic crawlers can't match.

This strategy balances speed (fast search) with completeness (targeted extraction), while real-time browsing ensures data freshness that indexed search vendors simply cannot provide.

Architecture Overview

Implementation

Step 1: Set Up Delta Table

First, we create a Delta table to store our companies and their enrichment data:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.types import StructType, StructField, StringType

from delta.tables import DeltaTable
TABLE_NAME = "users.ilanc.company_enrichment_demo"

spark = SparkSession.builder.getOrCreate()

# Sample companies to enrich
companies_data = [
    ("Anthropic", "anthropic.com"),
    ("OpenAI", "openai.com"),
    ("Databricks", "databricks.com"),
    ("Nimble", "nimbleway.com")
]

# Create DataFrame with enrichment columns
df = spark.createDataFrame(companies_data, ["company_name", "website"])

df = df.withColumn("address", lit(None).cast(StringType())) \
       .withColumn("funding", lit(None).cast(StringType())) \
       .withColumn("employees", lit(None).cast(StringType())) \
       .withColumn("investors", lit(None).cast(StringType())) \
       .withColumn("founders", lit(None).cast(StringType())) \
       .withColumn("enrichment_status", lit("pending"))

# Save as Delta table
df.write.format("delta").mode("overwrite").saveAsTable(TABLE_NAME)

Step 2: Configure the LangChain Agent

Next, we set up a LangChain agent with Claude Sonnet 4.5 and Nimble's search/extraction tools (click here to get your API Key) :

import os
import getpass
from typing import List
from pydantic import BaseModel, Field

from databricks_langchain import ChatDatabricks
from langchain.agents import create_agent
from langchain_nimble import NimbleExtractTool, NimbleSearchTool
from langchain.agents.middleware import SummarizationMiddleware

# Set up API key
if not os.environ.get("NIMBLE_API_KEY"):
    os.environ["NIMBLE_API_KEY"] = getpass.getpass("NIMBLE_API_KEY:\n")

# Here you switch between models from Anthropic, OpenAI, Google, Meta...
llm_model = ChatDatabricks(endpoint="databricks-claude-sonnet-4-5")

# Define the agent's strategy
prompt_template = """
You are a company enrichment agent. Use this two-step approach:

**Step 1: Fast Search**
Use search_tool with deep_search=false to get quick snippets and URLs.
Extract as much information as possible from the snippets.

**Step 2: Targeted Extraction (if needed)**
If information is missing, use extract_tool on 1-2 relevant URLs from the search results.
Focus on official company websites, LinkedIn, or Crunchbase.

**Required Information:**
- address: Full headquarters address
- funding: Total funding (e.g., "$100M Series B")
- employees: Count or range (e.g., "500-1000")
- investors: List of investor names
- founders: List of founder names

Return "Not found" for missing strings, empty list [] for missing arrays.
"""

# Define structured output schema
class CompanyInfo(BaseModel):
    """Company enrichment information"""
    address: str = Field(description="Company headquarters address")
    funding: str = Field(description="Total funding raised")
    employees: str = Field(description="Employee count or range")
    investors: List[str] = Field(description="List of investors")
    founders: List[str] = Field(description="List of founders")

# Create agent
agent = create_agent(
    model=llm_model,
    tools=[NimbleSearchTool(), NimbleExtractTool()],
    system_prompt=prompt_template,
    response_format=CompanyInfo
)

Step 3: Define the Data Enrichment Function

The enrichment function calls the agent and returns structured data:

import json

async def enrich_company(company_name: str, website: str) -> dict:
    """Use agent to enrich company data with structured output"""

    try:
        query = f"Find address, funding, employees, investors, and founders for {company_name} (website: {website})"

        # Stream agent execution
        async for step in agent.astream(
            {"messages": [{"role": "user", "content": query}]},
            stream_mode="values",
        ):
            pass  # Process streaming steps silently

        # Extract structured response
        structured = step["structured_response"]
        result = structured.model_dump()

        # Convert lists to JSON strings for Delta table storage
        result = {
            "address": result.get("address", "Not found"),
            "funding": result.get("funding", "Not found"),
            "employees": result.get("employees", "Not found"),
            "investors": json.dumps(result.get("investors", [])),
            "founders": json.dumps(result.get("founders", []))
        }
        return result
    except Exception as e:
        print(f"❌ Error enriching {company_name}: {str(e)}")
        return {
            "address": "Error",
            "funding": "Error",
            "employees": "Error",
            "investors": "[]",
            "founders": "[]"
        }

Step 4: Run the Enrichment Pipeline

Finally, we process all pending companies and update the Delta table:

from pyspark.sql.functions import col, lit

delta_table = DeltaTable.forName(spark, TABLE_NAME)
pending = spark.table(TABLE_NAME).filter(col("enrichment_status") == "pending").collect()

print(f"🚀 Enriching {len(pending)} companies...\n")

success_count = 0
error_count = 0

for idx, row in enumerate(pending, 1):
    company = row.company_name
    website = row.website

    print(f"[{idx}/{len(pending)}] Processing {company}...")

    try:
        # Enrich with agent
        data = await enrich_company(company, website)

        # Update Delta table
        delta_table.update(
            condition=col("company_name") == company,

            set={
                "address": lit(data["address"]),
                "funding": lit(data["funding"]),
                "employees": lit(data["employees"]),
                "investors": lit(data["investors"]),
                "founders": lit(data["founders"]),
                "enrichment_status": lit("completed" if data["address"] != "Error" else "failed")
            }
        )

        if data["address"] != "Error":
            print(f"   ✅ {data['address']}\n")
            success_count += 1
        else:
            error_count += 1

    except Exception as e:
        print(f"   ❌ Unexpected error: {str(e)}\n")
        error_count += 1

print(f"🎉 Complete! Success: {success_count}, Failed: {error_count}")

Next Steps: Scaling to Production

The basic implementation works well for small datasets, but what if you need to enrich 10,000 companies?
Here are two critical improvements for production-scale workloads.

1. Parallel Processing with Spark

The Problem: Our current implementation processes companies sequentially, one at a time. For 1,000 companies, this could take hours.

The Solution: Use Spark's `pandas_udf` to distribute enrichment across your cluster. Wrap your agent logic in a UDF that processes rows in parallel across multiple nodes.

Key optimization: Initialize the agent once per partition (not per row) to avoid overhead.

Results: 10-100x speedup depending on cluster size. Databricks autoscaling clusters dynamically adjust resources based on workload.

2. Multi-Agent Architecture for Complex AI Enrichment

For complex enrichment needs, a single agent handling all tasks faces context engineering challenges. When one agent must extract addresses, funding, investors, founders, and executives, its prompt becomes bloated with instructions for every domain, leading to lower accuracy and higher hallucination rates.

The Solution: Specialized Agents with Focused Context

Split enrichment into domain-expert agents, each with laser-focused prompts and specialized knowledge:

Company Info Agent: Focuses on headquarters, founding year, business model. Knows to prioritize official websites, Wikipedia, LinkedIn company pages.
Funding Agent: Extracts Series rounds, valuations, investors. Specialized in Crunchbase, PitchBook, SEC filings, and press releases.
People Agent: Finds founders and executives. Expert in LinkedIn profiles, company About pages, and executive bios.

Why Source Specialization Matters

Each agent knows the authoritative sources for its domain. The funding agent doesn't waste tokens searching LinkedIn profiles; the people agent skips Crunchbase tables. This source awareness prevents agents from searching irrelevant sources and improves accuracy by focusing on where reliable data actually lives.

A supervisor agent coordinates the specialized agents, ensuring consistency (handling alternative company names), detecting conflicts between agents, and assigning confidence scores based on cross-agent validation.

Performance vs. Complexity Trade-offs

Anthropic's research on multi-agent systems shows 90%+ performance improvements when using specialized agents coordinated by a supervisor, compared to single generalist agents. However, multi-agent systems consume 10-15x more tokens due to parallel execution and require more sophisticated error handling.

When to Use Multi-Agent? Best for complex, high-value enrichment where accuracy justifies the cost. For simpler use cases like the 5-field enrichment in this tutorial, the single-agent approach is sufficient and cost-effective.

Additional Production Tips

Cost Optimization with Nimble: Use `deep_search=false` for initial searches—this is significantly cheaper and faster (typically 80% cost reduction) while still providing rich snippets. Only use `deep_search=true` or targeted extraction when you need full page content. This two-tier approach minimizes API costs while maximizing data completeness.

Incremental Enrichment: Add `last_updated` timestamps and only re-enrich stale data (>30 days old) to reduce costs.

Smart Caching: Store search results in Delta tables to avoid redundant API calls for the same queries.

Confidence Scoring: Route low-confidence results (<0.8) to a human review queue for validation.

Error Handling: Implement retry logic with exponential backoff for transient failures, and use async/await patterns for I/O-bound operations to maximize throughput; this sits alongside broader production concerns like orchestration and monitoring data pipeline tools.

Build a Scalable Company Enrichment Agent with Nimble

Building an AI company enrichment agent is surprisingly straightforward with the right tools. By combining Databricks' data platform, LangChain's agent framework, Nimble's Real-Time Search API, and Claude Sonnet's reasoning, you can automate hours of manual research.

The two-step search strategy (fast search → targeted extraction) is particularly powerful because Nimble handles the hardest part of web data collection. With real-time web browsing, proprietary JavaScript rendering, site-specialized agents, and structured JSON output, Nimble delivers parse-ready data that your AI agents can immediately reason over. No HTML parsing, no browser management, no site-specific scrapers to maintain.

Structured output with Pydantic ensures data quality, while Delta tables provide production-grade storage with ACID guarantees. The result: a scalable, reliable enrichment pipeline that processes thousands of companies without manual intervention.

You can adapt this pattern for other enrichment tasks: contact enrichment, product research, competitive analysis, or market intelligence gathering, anywhere you need AI agents to autonomously gather web data at scale.

Ready to try it yourself? Check out the full notebook on GitHub and start enriching your own company data with Nimble's Search API.

Resources

Code & Documentation:

Multi-Agent Research & Context Engineering:

How we built our multi-agent research system - Anthropic Engineering
Effective context engineering for AI agents - Anthropic Engineering
AI Agent Systems: Modular Engineering for Reliable Enterprise AI Applications - Databricks Blog

---

Have questions or want to share your enrichment use case? Drop a comment below!