Databricks Community

ChrisP_Williams · ‎11-07-2024

Have you ever sunk into your sofa after a long day, craving some TV time but faced with too many options? With hundreds of TV channels and several streaming services, you're overwhelmed. Programmers, broadcasters, and platform developers work hard to solve this problem by providing Electronic Programme Guides (EPGs), guided user journeys and personalized recommendations to direct us to their content. However, classifying any kind of unstructured content is hard, and AI can help. This blog will walk you through a solution built on Databricks that will transform TV programme data to easy-to-consume content that can be used to create dashboards or simply ask questions of your data. The solution uses metadata available publicly that when joined with TV program data will create a more structured, searchable experience that will make finding TV programme schedules, channels and genres much easier.

To achieve our goals we will need to manage data quality, apply tags correctly, and maintain data cleanliness. Often, this process is manual, leading to inconsistencies and hindering the development of advanced use cases such as recommendations and improved search.

Content owners can access this data in a dashboard, through natural language or use APIs to enhance TV viewers' experiences.

High-level Solution

We will use various tools within the Databricks data intelligence platform. The process involves ingesting data into data volumes, processing it through a medallion architecture pipeline to extract the necessary information, and finally enriching the data with the help of an LLM (Llama). Here's a breakdown of the tools we'll use:

Ingesting the data: A Databricks notebook that pulls data from two external datasets (an XML file from Github and a weekly CSV file containing viewing figures).
Running the processing pipeline: This will use a data processing pipeline that leverages Delta Live tables to incrementally process new files as they arrive, ensuring you are declaratively defining your pipelines and that you are able to scale this efficiently.
AI enrichment using LLMs: Here we are tapping into our gold table in the previous stage and asking Llama§ (an open source foundation model) to extract the information we need from the title of the TV programme and its description. Using an LLM is a very straightforward way of getting additional classification for our content. We simply have to prompt the LLM with details about the programme and instruct it on the output format we need.
Query/ Visualize the metadata with AI/BI:

The next section will go over details of the end-to-end data pipeline that will run and ensure the TV data is continuously updated.

Ingesting and transforming the data to build an automated metadata management pipeline

This section describes a Databricks pipeline that can be adapted to your own data and use cases. It processes TV programme data from two sources: publicly available Electronic Programme Guide (EPG) data and British Audience Research Bureau (BARB) viewing figures.

This diagram shows the end-to-end architecture:

The data arrives in files. Databricks Auto Loader allows us to treat it as a stream.

Our pipeline is controlled by a multi-task Databricks Workflow. The workflow which runs daily consists of two regular “jobs” and a Delta Live Table (DLT) pipeline as follows:

Task 1: Fetch the day’s EPG Data

The EPG data is fetched daily but the BARB data needs to be manually uploaded on a weekly basis. The EPG Data is in a format known as XMLTV. XMLTV files have 2 types of records - 'channel' records (information about TV channels) and 'program' records (when and where programmes are shown).

Our ingestion task splits the data into two JSON files in Databricks volumes to be read as a stream by the DLT pipeline.

The BARB data is in CSV format. We manually add a date to the records and write them to a Databricks volume for reading by DLT.

Task 2: Process new data using Delta Live Tables (DLT)

The DLT pipeline picks up any new data from the EPG or BARB. DLT is a declarative framework that allows us to describe how we want the data to be transformed as it moves between the different layers. Databricks takes care of how to run and scale these transformations.

The DLT pipeline will create a data transformation pipeline based on the medallion architecture that will create three different layers of data transformations:

Bronze Layer

The DLT pipeline reads each of the three file types from the volumes using Autoloader into three Bronze streaming tables.

Contains the BARB data
Contains Channel Data
Contains the EPG programme showing events

This is how they look in the Databricks Unity Catalog:

This is the definition of the tvdemo_bronze_epg table in the DLT notebook:

CREATE OR REFRESH STREAMING TABLE tvdemo_bronze_epg
COMMENT 'Raw bronze EPG Data'
TBLPROPERTIES (quality= 'bronze')
AS SELECT *,_metadata as fileinfo
FROM cloud_files("/Volumes/shared/chris_williams/epg_json", "json")

Note that cloud_files indicates streaming from the volume using Autoloader. We use similar DLT codes to ingest the other two data sources into Bronze tables.

This is the DLT pipeline.

Silver Layer

Our silver layer uses a star schema as shown.

As we move from Bronze to Silver we clean the data making it more consistent. We match the EPG data with the BARB data to attach viewing figures to each programme.

Matching BARB and EPG data is not straightforward. BARB provides the following information in one text string:

PROGRAMME TITLE (DAY HH:MM)

e.g. CORONATION STREET (MON 19:30)

The network that the programme is shown on is in another column. The names of the networks don’t directly match those in the EPG so a look-up table is maintained.

These are the silver tables in the Unity Catalog:

When a new channel appears in the EPG data this is diverted into an exception table and a Databricks alert is sent so that the relevant reference data can be added.

This matching and cleansing is typical of how to populate a silver layer.

Gold Layer

The Gold table is a single, denormalised table with all of the EPG and BARB data for each programme showing.

Task 3: Enriches the data using Generative AI

AI enrichment is the last step of the pipeline. It uses an LLM (Llama 3.1, in this case) to enrich our TV programme data. We used the Databricks playground to compare different models and Llama 3.1 gave the best results. Should this change, the Databricks platform allows us to switch to a different model with a simple code change.

We call the LLM directly in SQL using the SQL AI function: ai_query.

The ai_query call looks like this:

 ai_query('databricks-meta-llama-3-1-70b-instruct',prompt)

It accesses the LLM via the Databricks model serving endpoint in the first parameter. ‘databricks-meta-llama-3-1-70b-instruct’. This is an instance of the Llama 3.1 open-sourced model hosted by Databricks but the endpoint could also point at a model fine-tuned for the purpose, a traditional classification model or a third party model from a service such as OpenAI - in this latter case Databricks is acting as a gateway providing controls over the calling of the third party model. This level of flexibility is a key benefit of using Databricks.

The prompt has to do the following:

Give the LLM its task - “Return a JSON Structure about the UK TV Programme described.”
Tell it what the JSON object schema should be:

“Use this schema

{"people":[{}"name","role","background"}],"location":[{"name","fictional_flag","latitude","longitude"}],"genre","sub_genre","topic","movie_flag","production_year","production_company","childrens_flag"}”

This part of the prompt is key in ensuring that we have a consistent structure to perform further transformations on.

We also give the model further instructions about formatting the output, “Do not include any other fields. Do not add any notes outside of the JSON. It is OK to have an empty array if there are no actual people. Set lat/long to null if location is fictional.” These instructions are part of the prompt engineering process and were added as a result of the testing in the Databricks Playground.
Finally, we add the context by concatenating the title and description for the programme.

||title||' '||description

Here is an example of a complete prompt:

“Return a JSON Structure about the UK TV Programme described. 

Use this schema {"people":[{}"name","role","background"}],"location":[{"name","fictional_flag","latitude","longitude"}],"genre","sub_genre","topic","movie_flag","production_year","production_company","childrens_flag"} 

Do not include any other fields. Do not add any notes outside of the JSON. It is OK to have an empty array if there are no actual people. Set lat/long to null if location is fictional.

CELEBRITY MASTERCHEF. 8/18. The third heat continues as the four remaining celebrities have their wits tested across three tasks that would make even the most experienced cook quake.”

This is the JSON object returned by the above prompt:

{
  "people": [
    {
      "name": "John Torode",
      "role": "Judge",
      "background": "Australian celebrity chef"
    },
    {
      "name": "Gregg Wallace",
      "role": "Judge",
      "background": "English television presenter and former greengrocer"
    }
  ],
  "location": [
    {
      "name": "United Kingdom",
      "fictional_flag": false,
      "latitude": 55.3781,
      "longitude": -3.4360
    }
  ],
  "genre": "Reality Television",
  "sub_genre": "Cooking Show",
  "topic": "Celebrity Cooking Competition",
  "movie_flag": false,
  "production_year": 2023,
  "production_company": "Endemol Shine UK",
"childrens_flag": false
}

The data is in a JSON object with a known schema so we can use the SQL function from_json to extract new columns to write to a table called tvdemo_gold_ai.

e.g. upper(from_json(ai_json, 'struct<genre:string>').genre) as genre

Where ai_json contains the JSON - in the example TV programme above the genre column contains the string “REALITY TELEVISION”.

The location and people fields are in arrays so we use the explode function to populate additional tables with one row per array element.

e.g. the tvdemo_gold_ai_people table contains these two rows for the above TV programme:

showing_PK Name Role

BBCOneLondonHD.uk20240829200000 +0100 John Torode Judge

BBCOneLondonHD.uk20240829200000 +0100 Gregg Wallace Judge

Note the composite primary key generated by the data engineering described in section 2.

We now have four gold tables which we can query:

Query/ Visualize the metadata with AI/BI

There are a number of ways to access this data:

Connect an external Business Intelligence tool such as Tableau, ThoughtSpot or Power/BI
Use the Databricks SQL Statement Execution API
Use a Databricks AI/BI dashboard
Use a Databricks AI/BI Genie space

The screenshot shows the example TV programme in a Databricks AI/BI dashboard:

Databricks AI/BI Genie spaces allow us to query the tables using natural language. For example here are some screenshots from an AI/BI Genie space built on our gold tables.

AI/BI Genie is able to deal with the misspelling of the programme title.

AI/BI Genie retains the context and we can refer to the TV programme as “it”.

We then prompt to show the top 50 for that week and press the “Auto Visualize” button resulting in the following chart:

Conclusions and call to action

Our solution demonstrates an end-to-end process for taking raw data, cleaning and joining it and using AI to enrich the data. The resulting Gold tables can then be queried using the Databricks AI/BI capabilities.

This process can easily be adapted to any content where you wish to enrich and enhance metadata and make it available for users to query and search or to guide users to the most relevant content for them.

Databricks Community

How-to guide: Automating Media Metadata Classification with Databricks Workflows and LLMs

High-level Solution

Ingesting and transforming the data to build an automated metadata management pipeline

Task 1: Fetch the day’s EPG Data

Task 2: Process new data using Delta Live Tables (DLT)

Bronze Layer

Silver Layer

Gold Layer

Task 3: Enriches the data using Generative AI

Query/ Visualize the metadata with AI/BI

Conclusions and call to action

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL

Metadata-Driven ETL Framework in Databricks (Part-1)