Have you ever sunk into your sofa after a long day, craving some TV time but faced with too many options? With hundreds of TV channels and several streaming services, you're overwhelmed. Programmers, broadcasters, and platform developers work hard to solve this problem by providing Electronic Programme Guides (EPGs), guided user journeys and personalized recommendations to direct us to their content. However, classifying any kind of unstructured content is hard, and AI can help. This blog will walk you through a solution built on Databricks that will transform TV programme data to easy-to-consume content that can be used to create dashboards or simply ask questions of your data. The solution uses metadata available publicly that when joined with TV program data will create a more structured, searchable experience that will make finding TV programme schedules, channels and genres much easier.
To achieve our goals we will need to manage data quality, apply tags correctly, and maintain data cleanliness. Often, this process is manual, leading to inconsistencies and hindering the development of advanced use cases such as recommendations and improved search.
Content owners can access this data in a dashboard, through natural language or use APIs to enhance TV viewers' experiences.
We will use various tools within the Databricks data intelligence platform. The process involves ingesting data into data volumes, processing it through a medallion architecture pipeline to extract the necessary information, and finally enriching the data with the help of an LLM (Llama). Here's a breakdown of the tools we'll use:
The next section will go over details of the end-to-end data pipeline that will run and ensure the TV data is continuously updated.
This section describes a Databricks pipeline that can be adapted to your own data and use cases. It processes TV programme data from two sources: publicly available Electronic Programme Guide (EPG) data and British Audience Research Bureau (BARB) viewing figures.
This diagram shows the end-to-end architecture:
The data arrives in files. Databricks Auto Loader allows us to treat it as a stream.
Our pipeline is controlled by a multi-task Databricks Workflow. The workflow which runs daily consists of two regular “jobs” and a Delta Live Table (DLT) pipeline as follows:
The EPG data is fetched daily but the BARB data needs to be manually uploaded on a weekly basis. The EPG Data is in a format known as XMLTV. XMLTV files have 2 types of records - 'channel' records (information about TV channels) and 'program' records (when and where programmes are shown).
Our ingestion task splits the data into two JSON files in Databricks volumes to be read as a stream by the DLT pipeline.
The BARB data is in CSV format. We manually add a date to the records and write them to a Databricks volume for reading by DLT.
The DLT pipeline picks up any new data from the EPG or BARB. DLT is a declarative framework that allows us to describe how we want the data to be transformed as it moves between the different layers. Databricks takes care of how to run and scale these transformations.
The DLT pipeline will create a data transformation pipeline based on the medallion architecture that will create three different layers of data transformations:
The DLT pipeline reads each of the three file types from the volumes using Autoloader into three Bronze streaming tables.
This is how they look in the Databricks Unity Catalog:
This is the definition of the tvdemo_bronze_epg table in the DLT notebook:
|
Note that cloud_files indicates streaming from the volume using Autoloader. We use similar DLT codes to ingest the other two data sources into Bronze tables.
This is the DLT pipeline.
Our silver layer uses a star schema as shown.
As we move from Bronze to Silver we clean the data making it more consistent. We match the EPG data with the BARB data to attach viewing figures to each programme.
Matching BARB and EPG data is not straightforward. BARB provides the following information in one text string:
|
The network that the programme is shown on is in another column. The names of the networks don’t directly match those in the EPG so a look-up table is maintained.
These are the silver tables in the Unity Catalog:
When a new channel appears in the EPG data this is diverted into an exception table and a Databricks alert is sent so that the relevant reference data can be added.
This matching and cleansing is typical of how to populate a silver layer.
The Gold table is a single, denormalised table with all of the EPG and BARB data for each programme showing.
AI enrichment is the last step of the pipeline. It uses an LLM (Llama 3.1, in this case) to enrich our TV programme data. We used the Databricks playground to compare different models and Llama 3.1 gave the best results. Should this change, the Databricks platform allows us to switch to a different model with a simple code change.
We call the LLM directly in SQL using the SQL AI function: ai_query.
The ai_query call looks like this:
|
It accesses the LLM via the Databricks model serving endpoint in the first parameter. ‘databricks-meta-llama-3-1-70b-instruct’. This is an instance of the Llama 3.1 open-sourced model hosted by Databricks but the endpoint could also point at a model fine-tuned for the purpose, a traditional classification model or a third party model from a service such as OpenAI - in this latter case Databricks is acting as a gateway providing controls over the calling of the third party model. This level of flexibility is a key benefit of using Databricks.
The prompt has to do the following:
“Use this schema
{"people":[{}"name","role","background"}],"location":[{"name","fictional_flag","latitude","longitude"}],"genre","sub_genre","topic","movie_flag","production_year","production_company","childrens_flag"}”
This part of the prompt is key in ensuring that we have a consistent structure to perform further transformations on.
||title||' '||description
Here is an example of a complete prompt:
|
This is the JSON object returned by the above prompt:
|
The data is in a JSON object with a known schema so we can use the SQL function from_json to extract new columns to write to a table called tvdemo_gold_ai.
e.g. upper(from_json(ai_json, 'struct<genre:string>').genre) as genre
Where ai_json contains the JSON - in the example TV programme above the genre column contains the string “REALITY TELEVISION”.
The location and people fields are in arrays so we use the explode function to populate additional tables with one row per array element.
e.g. the tvdemo_gold_ai_people table contains these two rows for the above TV programme:
showing_PK Name Role
BBCOneLondonHD.uk20240829200000 +0100 John Torode Judge
BBCOneLondonHD.uk20240829200000 +0100 Gregg Wallace Judge
Note the composite primary key generated by the data engineering described in section 2.
We now have four gold tables which we can query:
There are a number of ways to access this data:
The screenshot shows the example TV programme in a Databricks AI/BI dashboard:
Databricks AI/BI Genie spaces allow us to query the tables using natural language. For example here are some screenshots from an AI/BI Genie space built on our gold tables.
AI/BI Genie is able to deal with the misspelling of the programme title.
AI/BI Genie retains the context and we can refer to the TV programme as “it”.
We then prompt to show the top 50 for that week and press the “Auto Visualize” button resulting in the following chart:
Our solution demonstrates an end-to-end process for taking raw data, cleaning and joining it and using AI to enrich the data. The resulting Gold tables can then be queried using the Databricks AI/BI capabilities.
This process can easily be adapted to any content where you wish to enrich and enhance metadata and make it available for users to query and search or to guide users to the most relevant content for them.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.