In today's data-driven enterprise landscape, organizations are looking to harness the power of Large Language Models (LLMs) to extract insights from their vast amounts of structured and unstructured data stored in various Software as a Service (SaaS) applications. However, integrating enterprise data with LLMs often involves complex ETL pipelines, specialized infrastructure, and significant engineering effort. Databricks Lakeflow Connect and AI Functions for batch inference provide a streamlined approach to this challenge, enabling teams to quickly build production-ready pipelines that transform raw SaaS data into actionable insights.
Organizations often struggle with connecting their valuable SaaS application data to modern LLM-powered analytics due to several challenges:
Databricks addresses these challenges with a unified approach combining Lakeflow Connect for data ingestion and AI Functions for batch LLM inference. This integration enables data teams to build end-to-end data pipelines with minimal overhead.
Lakeflow Connect offers fully-managed connectors that enable you to easily ingest data from SaaS applications and databases into your Databricks Lakehouse. The entire ingestion pipeline is governed by Unity Catalog and powered by serverless compute and Lakeflow Pipelines.
This solution brings together essential components to automate and simplify data ingestion from diverse SaaS sources:
Lakeflow Connect supports a wide range of data sources:
*Managed SaaS and database connectors provided by Lakeflow Connect are in various release states. The release states and range of supported sources will continue to evolve over time.
Lakeflow Connect uses efficient incremental read and write operations to accelerate data ingestion, improve scalability, reduce costs, and maintain data freshness for downstream use.
Challenges of Integrating SaaS Data with LLMs |
Lakeflow Connect Solution |
Data from SaaS applications exists in proprietary formats and requires custom connectors |
Provides fully managed, ready-to-use connectors for popular SaaS sources and databases, eliminating the need for custom ETL. |
Building and maintaining ingestion pipelines is resource-intensive |
Automates ingestion pipelines with serverless Lakeflow Pipelines, reducing engineering effort and operational overhead. |
Processing large volumes of text data through LLMs requires specialized infrastructure |
Integrates with Databricks AI Functions for scalable, batch LLM inference-no need to provision or manage custom infrastructure. |
Ensuring security, governance, and compliance across the pipeline adds complexity |
Leverages Unity Catalog for fine-grained access control, credential management, and end-to-end data governance. |
Databricks AI Functions provide a simplified way to apply LLMs to your data at scale. For batch processing, the ai_query function is particularly useful as it allows you to process large volumes of data through an LLM in a scalable and efficient manner.
The ai_query function can be used in SQL or Python to apply LLM processing to your data. It supports:
When you use a Databricks-hosted foundation model for batch inference, Databricks configures a provisioned throughput endpoint that scales automatically based on the workload.
Before you can use ai_query for batch inference, make sure the following requirements are met:
Building an end-to-end LLM pipeline for employee feedback
Now, let's look at an example of a complete pipeline that:
First, we'll create a connection to our Workday instance. This connection securely stores the authentication credentials needed to access the Workday API.
Before diving into the code, let's understand what a Workday connection represents in the Lakeflow Connect architecture.
A Workday connection in Databricks is a Unity Catalog securable object that securely stores and manages the authentication credentials needed to access your Workday instance. This connection object serves as the foundation for all data pipelines that will extract data from Workday reports.
When setting up a Workday connection, consider these important aspects:
For Workday specifically, the connection enables access to custom reports containing valuable HR data like employee feedback, performance reviews, and organizational metrics - making it ideal for our employee engagement analysis use case.
|
After creating the connection object, we need to define a Lakeflow Connect pipeline that will use this connection to extract data from specific Workday reports. The pipeline configuration specifies:
This declarative configuration approach greatly simplifies what would otherwise require custom ETL code to handle API pagination, error handling, and incremental data loading.
|
The resulting ingestion pipeline is fully managed by Databricks - it runs on serverless compute, automatically scales based on data volume, and integrates with Lakeflow Pipelines for reliable data delivery.
Once our data is ingested, we create a view that preprocesses our employee feedback data to extract the relevant text fields. We create the feedback_text column which combines all of our unstructured fields into a single text field that will make it easier to run the batch inference over.
|
In this step, we use the ai_query function to analyze employee feedback data at scale. Below, we break down the key components of the ai_query call to clarify the choices and structure:
We use the endpoint 'databricks-meta-llama-3-1-8b-instruct', which is a Databricks-hosted foundation model optimized for instruction-following tasks. This model is chosen for its strong performance on enterprise text analysis and its compatibility with structured output prompts. Using a Databricks-hosted model also means you don’t need to manage your own model serving infrastructure-the endpoint is provisioned and scaled automatically by Databricks.
The prompt is carefully crafted to guide the LLM to extract actionable insights from each feedback record. It provides explicit instructions and a list of expected outputs:
|
Finally, we parse the JSON output from the LLM to create structured fields we can analyze:
Now we can perform analytics on the structured data:
|
Our queries yield the following results:
Sentiment Distribution Across Departments
Department |
Sentiment |
Response Count |
Sales |
Positive |
120 |
Sales |
Neutral |
30 |
Sales |
Negative |
10 |
Engineering |
Positive |
150 |
Engineering |
Negative |
20 |
HR |
Positive |
80 |
HR |
Neutral |
25 |
HR |
Negative |
5 |
Top Engagement Drivers by Tenure Group
Tenure Group |
Driver |
Mentions |
Less than 1 year |
Work-Life Balance |
50 |
Less than 1 year |
Management |
40 |
1-3 years |
Compensation |
70 |
1-3 years |
Career Growth |
60 |
4-7 years |
Company Culture |
45 |
4-7 years |
Team Dynamics |
35 |
8+ years |
Work Environment |
55 |
8+ years |
Job Satisfaction |
50 |
We’ve successfully leveraged our Workday data to create actionable insights for HR and leadership teams by highlighting both the sentiment landscape and key engagement drivers segmented by department and tenure group.
Combining Lakeflow Connect with Batch LLM Inference offers a powerful, scalable solution for applying LLMs to SaaS application data. This integrated approach enables organizations to:
By unifying data ingestion and LLM processing, this solution overcomes key enterprise challenges - delivering a streamlined, maintainable architecture for large-scale AI workloads.
Ready to get started with Lakeflow Connect and Batch LLM Inference? Here are your next steps:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.