Large Language Models (LLMs) or GenAI have taken center stage in their ability to summarize large text, write code, articles, poems, explain complex topics, and power now ubiquitous chatbots. Data Engineers may think LLMs have little relevance to their line of work but this couldn’t be further from the truth. GenAI is a game-changer in extracting structure from unstructured data during the ETL process. By leveraging their advanced natural language processing (NLP) capabilities, LLMs can automatically identify, classify, and extract relevant information from unstructured data.
Performing ETL (Extract, Transform, Load) on unstructured data is a daunting task that can leave even the most seasoned data engineers frustrated. Unlike structured data, which neatly fits into predefined schemas and formats, unstructured data is the Wild West. As a result, extracting relevant information from unstructured data requires a tremendous amount of manual effort, custom coding, and data wrangling – only to be faced with the risk of data quality issues, inconsistencies, and inaccuracies. As an example, review this regular expression, ^(?!.*?[._-]{2,})[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$ .
It may take someone with advanced regex experience a few minutes to understand this expression is attempting to match emails and even more time if it is accurate. These challenges and costs of trying to extract structure from chaos often means the value of unstructured data is never realized.
A great example of this is worker compensation insurance offered by states. Workers compensation insurance is a type of insurance that provides financial assistance to employees who are injured on the job. When an employee is injured on the job, a claim is filed with the state agency overseeing the insurance program. A part of the claim may include free form notes describing the injury or the state of the employee. Knowing the type of injury or body part injured in a structured format has long been a desire of state insurance organizations due to the strong ROI of this data. However, since the agency receives hundreds or thousands of claims each month, manually extracting the data isn’t cost effective. While natural language processing (NLP) has been around for over a decade, hiring a team of data scientists to try and build a NLP model often is not feasible both in time and in financial cost.
Below is a sample of claim notes that are ingested as part of a nightly ETL process. Our goal is to label the body part injured, type of injury and if it's a serious injury.
The first step is to get access to a LLM. While data engineers are highly skilled, many are also not data scientists or system administrators. They may not have knowledge of how or time to deploy a LLM including all of the dependencies, which type of GPU they need and how to connect the LLM to their ETL process. Fortunately there is an easy button. Using the Databricks Intelligence Platform, all the complexities of running LLMs are abstracted away from data engineers allowing them to focus on delivering value faster. Leveraging Databricks' ai_query function we can use an LLM to extract the required information.
CREATE OR REPLACE FUNCTION ANNOTATE_CLAIM(claim STRING)
RETURNS STRUCT<product_name: STRING, entity_sentiment: STRING, followup: STRING, followup_reason: STRING>
RETURN FROM_JSON(
ai_query(
'databricks-meta-llama-3-1-70b-instruct',
CONCAT(
'We need to get the primary injury to the body and the type of injury.
extract the following information:
- classify body parted involved as ["head","neck","hand", "leg","eye", "back", "foot","full body","unknown"]
- injury type as: ["broken","laceration","strain","burn","internal","bruise","unknown"]
- serious: if the injury is likely to cause loss of limb or life
Return JSON ONLY. No other text outside the JSON. JSON format:
{
"body_part": <body part>,
"injury_type": <injury type>,
"serious": <Y or N for follow up>
}
'),
"STRUCT<body_part: STRING, injury_type: STRING, serious: STRING>")
Let's break down what this function does. First we define the function in line 1. This will allow us to use the LLM in SQL or Python. Next line 2 is the return type, a structure of a JSON object. Then we use ai_query in lines 4 and 5 using Llama 3.1 as the LLM. Llama is being hosted so we don’t need to worry about any setup in regards to the LLM. In addition, we can use any of the Databricks foundational models which allows for rapid testing of which LLM best suits our needs. The only part left to do is to provide the prompt with what data we want to extract and the desired output format covering lines 6 to 20. With that we are ready to provide structure to our unstructured claim notes simply using SQL.
SELECT ANNOTATE_CLAIM(ClaimDescription) AS annotated_claim
Below is a sample of the new column of our new structured data.
{
body_part: "hand"
injury_type: "laceration"
serious: "N"
}
With the claim note now in a structured format, analytics or dashboards using Databricks AI/BI can be created or use Databricks Genie to ask natural language questions like “what is the most common injured body part?” can unlock what once was not possible.
The emergence of GenAI in combination with Databricks has dramatically lowered the cost and technical barriers to leveraging Large Language Models (LLMs) for extracting structured data from unstructured data. No longer do data engineers need to be masters of regular expressions, NLP or data scientists to unlock the vast amounts of value they have in their existing unstructured mountains of data.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.