05-10-2024 12:04 PM - edited 08-15-2024 11:09 AM
The demo example in this guide illustrates a lakehouse analytics pipeline using the well-known NYC taxi trip dataset. This public dataset is also available at Kaggle.
This guide provides a practical demonstration of how to structure a data pipeline using the medallion architecture, progressively refining and analyzing data as it moves through each layer. It also demonstrates the pipeline’s integration with Unity Catalog, showing how to set up data lineage and governance. Finally, it covers AI/BI dashboards for enhancing analytical decision-making and sharing insights with team members.
This is a beginner’s tutorial with hands-on instructions to execute in your own Databricks workspace. You can request a free 14-day trial.
Understanding the medallion architecture
The medallion architecture is a method for organizing and refining data in a lakehouse by moving it through three layers—Bronze (raw data), Silver (cleaned data), and Gold (final, ready-to-use data).
Bronze Table: Raw data ingestion
Here, raw taxi trip data is ingested, with a basic data quality check applied to ensure trip distances are positive.
- Bronze layer: Raw data ingestion
CREATE OR REPLACE TABLE taxi_raw_records AS
SELECT *
FROM samples.nyctaxi.trips
WHERE trip_distance > 0.0;
-Silver layer
The Silver layer creates two tables:
Silver Table 1: Flagged rides
This table identifies potentially suspicious rides based on fare and distance criteria.
-- Silver Table 1: Flagged rides
CREATE OR REPLACE TABLE flagged_rides AS
SELECT
date_trunc("week", tpep_pickup_datetime) AS week,
pickup_zip AS zip,
fare_amount,
trip_distance
FROM
taxi_raw_records
WHERE ((pickup_zip = dropoff_zip AND fare_amount > 50) OR
(trip_distance < 5 AND fare_amount > 50));
Silver Table 2: Weekly statistics
This silver table calculates weekly average fares and trip distances.
- Silver layer 2: Weekly statistics
CREATE OR REPLACE TABLE weekly_stats AS
SELECT
date_trunc("week", tpep_pickup_datetime) AS week,
AVG(fare_amount) AS avg_amount,
AVG(trip_distance) AS avg_distance
FROM
taxi_raw_records
GROUP BY week
ORDER BY week ASC;
Gold layer
Gold Table 1: Top N rides
Here, these silver tables are integrated to provide a comprehensive view of the top three highest-fare rides.
-- Gold layer: Top N rides to investigate
CREATE OR REPLACE TABLE top_n AS
SELECT
ws.week,
ROUND(ws.avg_amount, 2) AS avg_amount,
ROUND(ws.avg_distance, 3) AS avg_distance,
fr.fare_amount,
fr.trip_distance,
fr.zip
FROM
flagged_rides fr
LEFT JOIN weekly_stats ws ON ws.week = fr.week
ORDER BY fr.fare_amount DESC
LIMIT 3;
To ensure that the Bronze, Silver, and Gold tables are regularly updated with fresh data, it’s recommended to schedule the notebook as a job to run periodically. This will keep the tables current with the latest information. To schedule a notebook job to run periodically:
Explore and manage your generated datasets through Unity Catalog using Catalog Explorer. Unity Catalog organizes data in a three-level namespace: Catalog.Schema.Table. Follow the steps below to examine the data produced by your pipeline and visualize its lineage. This allows you to inspect sample data, view table details, and explore the end-to-end data lineage of your pipeline objects.
While you develop a dashboard, your progress is saved as a draft. To create a clean copy for easy consumption, publish your dashboard. After you publish a dashboard, the published version remains intact until you publish again, even if you make changes to the draft. You can make modifications and improvements to the draft version without affecting the published copy.
Add users to your Databricks Workspace
Continue to add as many users to your account as you would like. New users receive an email prompting them to set up their account.
Share the dashboard with colleagues
You’ve now created, run, and analyzed an analytics pipeline demonstrating data processing, data quality checks, and the creation of analytics-ready tables and dashboards. Not only have you gained insights into the taxi trip data, but you’ve also likely learned how expensive a ride in the Big Apple can be if things go wrong — perhaps it’s time to consider the subway!
As you become more comfortable with ETL pipelines, you can expand this pipeline with more complex transformations, data streaming, and more comprehensive data quality checks.
"Excellent exercise! The most challenging part was setting up the environment using the Azure free account. However, now that I have access to all the functionalities, I am well-equipped to continue training and building my skills with Databricks. Looking forward to diving deeper into the platform!"
Really well explained with Demo