Databricks Community

krkumar · ‎05-13-2025

In today's data-driven enterprise landscape, organizations are looking to harness the power of Large Language Models (LLMs) to extract insights from their vast amounts of structured and unstructured data stored in various Software as a Service (SaaS) applications. However, integrating enterprise data with LLMs often involves complex ETL pipelines, specialized infrastructure, and significant engineering effort. Databricks Lakeflow Connect and AI Functions for batch inference provide a streamlined approach to this challenge, enabling teams to quickly build production-ready pipelines that transform raw SaaS data into actionable insights.

Why Integrating SaaS Data with LLMs is Challenging

Organizations often struggle with connecting their valuable SaaS application data to modern LLM-powered analytics due to several challenges:

Data from SaaS applications exists in proprietary formats and requires custom connectors
Building and maintaining ingestion pipelines is resource-intensive
Processing large volumes of text data through LLMs requires specialized infrastructure
Ensuring security, governance, and compliance across the pipeline adds complexity

Databricks addresses these challenges with a unified approach combining Lakeflow Connect for data ingestion and AI Functions for batch LLM inference. This integration enables data teams to build end-to-end data pipelines with minimal overhead.

What is Lakeflow Connect?

Lakeflow Connect offers fully-managed connectors that enable you to easily ingest data from SaaS applications and databases into your Databricks Lakehouse. The entire ingestion pipeline is governed by Unity Catalog and powered by serverless compute and Lakeflow Pipelines.

Key components of Lakeflow Connect

This solution brings together essential components to automate and simplify data ingestion from diverse SaaS sources:

Connections: A Unity Catalog securable object that stores authentication details for the SaaS application or database.
Ingestion pipelines: Transforms data into Delta tables, modeled as a serverless Lakeflow pipeline.
Databases:

Gateway: Extracts data from the source database while maintaining transaction integrity
Staging storage: A Unity Catalog volume where data is staged before being applied to Delta tables

Supported data sources

Lakeflow Connect supports a wide range of data sources:

Databases: PostgreSQL, SQL Server
Business applications: Salesforce, SharePoint, Workday, NetSuite, ServiceNow, Google Analytics
Unstructured data: PDFs and Excel files from sources like SharePoint

*Managed SaaS and database connectors provided by Lakeflow Connect are in various release states. The release states and range of supported sources will continue to evolve over time.

Lakeflow Connect uses efficient incremental read and write operations to accelerate data ingestion, improve scalability, reduce costs, and maintain data freshness for downstream use.

Challenges of Integrating SaaS Data with LLMs	Lakeflow Connect Solution
Data from SaaS applications exists in proprietary formats and requires custom connectors	Provides fully managed, ready-to-use connectors for popular SaaS sources and databases, eliminating the need for custom ETL.
Building and maintaining ingestion pipelines is resource-intensive	Automates ingestion pipelines with serverless Lakeflow Pipelines, reducing engineering effort and operational overhead.
Processing large volumes of text data through LLMs requires specialized infrastructure	Integrates with Databricks AI Functions for scalable, batch LLM inference-no need to provision or manage custom infrastructure.
Ensuring security, governance, and compliance across the pipeline adds complexity	Leverages Unity Catalog for fine-grained access control, credential management, and end-to-end data governance.

What is Batch LLM Inference with AI Functions?

Databricks AI Functions provide a simplified way to apply LLMs to your data at scale. For batch processing, the ai_query function is particularly useful as it allows you to process large volumes of data through an LLM in a scalable and efficient manner.

How ai_query works

The ai_query function can be used in SQL or Python to apply LLM processing to your data. It supports:

Databricks-hosted foundation models: Pre-provisioned LLMs that Databricks manages, such as Llama 4
Custom model serving endpoints: Your own fine-tuned models or models hosted elsewhere

When you use a Databricks-hosted foundation model for batch inference, Databricks configures a provisioned throughput endpoint that scales automatically based on the workload.

Before you can use ai_query for batch inference, make sure the following requirements are met:

You’re working in a region that supports Foundation Model APIs
You have query permissions on the Delta tables you plan to process

Screenshot 2025-05-08 at 11.22.41 PM.png

Building an end-to-end LLM pipeline for employee feedback

Now, let's look at an example of a complete pipeline that:

Ingests employee feedback data from Workday using Lakeflow Connect
Preprocesses the data to extract relevant text fields
Applies LLM inference using ai_query to analyze employee sentiment and identify engagement drivers
Stores the results for downstream analysis and action planning

Step 1: Set up Lakeflow Connect for Workday

First, we'll create a connection to our Workday instance. This connection securely stores the authentication credentials needed to access the Workday API.

Before diving into the code, let's understand what a Workday connection represents in the Lakeflow Connect architecture.

A Workday connection in Databricks is a Unity Catalog securable object that securely stores and manages the authentication credentials needed to access your Workday instance. This connection object serves as the foundation for all data pipelines that will extract data from Workday reports.

When setting up a Workday connection, consider these important aspects:

Security: All credentials are encrypted and stored as Unity Catalog securable objects, giving you fine-grained access control over who can use these connections
Permissions: You'll need CREATE CONNECTION privileges on the metastore to create new connections
Networking: If your workspace uses serverless egress control, you'll need to allowlist the Workday hostnames in your network policies

For Workday specifically, the connection enables access to custom reports containing valuable HR data like employee feedback, performance reviews, and organizational metrics - making it ideal for our employee engagement analysis use case.

# This code would typically be executed in a Databricks notebook
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import connections

# Initialize the workspace client
w = WorkspaceClient()

# Create a connection to Workday
workday_conn = w.connections.create(
    name="workday_connection",
    connection_type="WORKDAY",
    properties={
        "username": "your_workday_username",
        "password": "your_workday_password",
        "tenant_url": "https://your-instance.workday.com"
    }
)
print(f"Connection created with ID: {workday_conn.id}")

After creating the connection object, we need to define a Lakeflow Connect pipeline that will use this connection to extract data from specific Workday reports. The pipeline configuration specifies:

Where extracted data will be staged (storage parameter)
Which reports to extract from Workday (source.objects)
How to map the data to destination tables in Unity Catalog (destination_catalog, destination_schema, etc.)
How to identify unique records (primary_keys)

This declarative configuration approach greatly simplifies what would otherwise require custom ETL code to handle API pagination, error handling, and incremental data loading.

from databricks.sdk.service import pipelines

# Define the pipeline configuration
pipeline_config = {
    "name": "workday_feedback_ingestion",
    "storage": "UC volume 'main.hr_data.staging'",
    # Rest of the configuration...
}

# Create the pipeline
pipeline = w.pipelines.create(**pipeline_config)

The resulting ingestion pipeline is fully managed by Databricks - it runs on serverless compute, automatically scales based on data volume, and integrates with Lakeflow Pipelines for reliable data delivery.

Step 2: Preprocess the ingested data

Once our data is ingested, we create a view that preprocesses our employee feedback data to extract the relevant text fields. We create the feedback_text column which combines all of our unstructured fields into a single text field that will make it easier to run the batch inference over.

%sql
CREATE OR REPLACE VIEW main.hr_data.employee_feedback AS
SELECT
  ef.Survey_Response_ID,
  ef.Employee_ID,
  e.Department,
  e.Manager_ID,
  e.Location,
  e.Tenure_Years,
  ef.Survey_Date,
  ef.Overall_Satisfaction_Score,
  ef.Verbatim_Feedback,
  ef.What_Do_You_Like_Most,
  ef.What_Could_Be_Improved,
  ef.Career_Growth_Comments,
  CONCAT(
    'Overall Satisfaction: ', ef.Overall_Satisfaction_Score, '\n\n',
    'What do you like most: ', COALESCE(ef.What_Do_You_Like_Most, ''), '\n\n',
    'What could be improved: ', COALESCE(ef.What_Could_Be_Improved, ''), '\n\n',
    'Career growth comments: ', COALESCE(ef.Career_Growth_Comments, ''), '\n\n',
    'Additional feedback: ', COALESCE(ef.Verbatim_Feedback, '')
  ) AS feedback_text
FROM
  main.hr_data.employee_feedback_raw ef
JOIN
  main.hr_data.employee_data e ON ef.Employee_ID = e.Employee_ID
WHERE
  (ef.What_Do_You_Like_Most IS NOT NULL OR 
   ef.What_Could_Be_Improved IS NOT NULL OR 
   ef.Career_Growth_Comments IS NOT NULL OR 
   ef.Verbatim_Feedback IS NOT NULL);

Step 3: Apply batch LLM inference

In this step, we use the ai_query function to analyze employee feedback data at scale. Below, we break down the key components of the ai_query call to clarify the choices and structure:

Model Selection

We use the endpoint 'databricks-meta-llama-3-1-8b-instruct', which is a Databricks-hosted foundation model optimized for instruction-following tasks. This model is chosen for its strong performance on enterprise text analysis and its compatibility with structured output prompts. Using a Databricks-hosted model also means you don’t need to manage your own model serving infrastructure-the endpoint is provisioned and scaled automatically by Databricks.

Prompt Construction

The prompt is carefully crafted to guide the LLM to extract actionable insights from each feedback record. It provides explicit instructions and a list of expected outputs:

Context and Instructions:
The prompt starts by instructing the model to analyze the employee feedback and extract:

Overall sentiment (Positive, Negative, Mixed, Neutral)
Primary engagement drivers (from a provided list)
Key themes
Recommended actions for management

Structured Output Example:
The prompt includes a sample JSON object specifying the required keys and structure. This helps the LLM return outputs that are easily parsed for downstream analytics.
Field to Analyze:
The actual employee feedback text is appended at the end of the prompt, ensuring the model has all the necessary context for each row in the table. This concatenation is done using SQL’s CONCAT() function.

%sql
-- Create a table to store the LLM results
CREATE TABLE IF NOT EXISTS main.hr_data.employee_insights
AS
SELECT
  Survey_Response_ID,
  Employee_ID,
  Department,
  Manager_ID,
  Location,
  Tenure_Years,
  Survey_Date,
  Overall_Satisfaction_Score,
  ai_query(
    'databricks-meta-llama-3-1-8b-instruct',
    CONCAT(
      'Analyze the following employee feedback and extract:
      1. The overall sentiment (Positive, Negative, Mixed, Neutral)
      2. The primary engagement drivers (choose from: Work-Life Balance, Management, Compensation, Career Growth, Company Culture, Team Dynamics, Work Environment, Job Satisfaction)
      3. Key themes from the feedback
      4. Recommended actions for management

      Employee feedback:
      ', feedback_text),
    responseFormat => '{
      "type": "json_schema",
      "json_schema": {
        "schema": {
          "type": "object",
          "properties": {
            "sentiment": {"type": "string"},
            "engagement_drivers": {"type": "array", "items": {"type": "string"}},
            "key_themes": {"type": "array", "items": {"type": "string"}},
            "recommended_actions": {"type": "array", "items": {"type": "string"}}
          }
        },
        "strict": true
      }
    }'
  ) AS insights
FROM
  main.hr_data.employee_feedback;

Step 4: Parse and analyze the LLM results

Finally, we parse the JSON output from the LLM to create structured fields we can analyze:

%sql
CREATE OR REPLACE VIEW main.hr_data.engagement_analysis AS
SELECT
  Survey_Response_ID,
  Employee_ID,
  Department,
  Manager_ID,
  Location,
  Tenure_Years,
  Survey_Date,
  Overall_Satisfaction_Score,
  insights,
  
  -- Parse JSON fields
  JSON_VALUE(insights, '$.sentiment') AS sentiment,
  JSON_QUERY(insights, '$.engagement_drivers') AS engagement_drivers,
  JSON_QUERY(insights, '$.key_themes') AS key_themes,
  JSON_QUERY(insights, '$.recommended_actions') AS recommended_actions
FROM
  main.hr_data.employee_insights;

Now we can perform analytics on the structured data:

%sql
-- See the distribution of sentiment across departments
SELECT
  Department,
  sentiment,
  COUNT(*) AS response_count
FROM
  main.hr_data.engagement_analysis
GROUP BY
  Department, sentiment
ORDER BY
  Department, response_count DESC;

-- Analyze top engagement drivers by tenure
SELECT
  CASE 
    WHEN Tenure_Years < 1 THEN 'Less than 1 year'
    WHEN Tenure_Years BETWEEN 1 AND 3 THEN '1-3 years'
    WHEN Tenure_Years BETWEEN 4 AND 7 THEN '4-7 years'
    ELSE '8+ years'
  END AS tenure_group,
  driver,
  COUNT(*) AS mentions
FROM
  main.hr_data.engagement_analysis,
  LATERAL FLATTEN(PARSE_JSON(engagement_drivers)) AS flattened_drivers
GROUP BY
  tenure_group, driver
ORDER BY
  tenure_group, mentions DESC;

Our queries yield the following results:

Sentiment Distribution Across Departments

Department	Sentiment	Response Count
Sales	Positive	120
Sales	Neutral	30
Sales	Negative	10
Engineering	Positive	150
Engineering	Negative	20
HR	Positive	80
HR	Neutral	25
HR	Negative	5

Top Engagement Drivers by Tenure Group

Tenure Group	Driver	Mentions
Less than 1 year	Work-Life Balance	50
Less than 1 year	Management	40
1-3 years	Compensation	70
1-3 years	Career Growth	60
4-7 years	Company Culture	45
4-7 years	Team Dynamics	35
8+ years	Work Environment	55
8+ years	Job Satisfaction	50

We’ve successfully leveraged our Workday data to create actionable insights for HR and leadership teams by highlighting both the sentiment landscape and key engagement drivers segmented by department and tenure group.

Conclusion

Combining Lakeflow Connect with Batch LLM Inference offers a powerful, scalable solution for applying LLMs to SaaS application data. This integrated approach enables organizations to:

Ingest data from a wide range of SaaS applications with minimal configuration
Process large volumes of text data using batch LLM inference
Extract structured insights that drive informed business decisions
Automate the entire pipeline for continuous, real-time analysis

By unifying data ingestion and LLM processing, this solution overcomes key enterprise challenges - delivering a streamlined, maintainable architecture for large-scale AI workloads.

Ready to get started with Lakeflow Connect and Batch LLM Inference? Here are your next steps:

Use the Lakeflow Connect Previews to start ingesting data from your SaaS applications
Explore the Databricks documentation on AI Functions and batch inference
Join the Databricks community to share your experiences and learn from others