- Article History
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
05-10-2024 12:04 PM - edited 08-15-2024 11:09 AM
Getting started with Databricks - Build a simple Lakehouse analytics pipeline
The demo example in this guide illustrates a lakehouse analytics pipeline using the well-known NYC taxi trip dataset. This public dataset is also available at Kaggle.
This guide provides a practical demonstration of how to structure a data pipeline using the medallion architecture, progressively refining and analyzing data as it moves through each layer. It also demonstrates the pipeline’s integration with Unity Catalog, showing how to set up data lineage and governance. Finally, it covers AI/BI dashboards for enhancing analytical decision-making and sharing insights with team members.
This is a beginner’s tutorial with hands-on instructions to execute in your own Databricks workspace. You can request a free 14-day trial.
Understanding the medallion architecture
The medallion architecture is a method for organizing and refining data in a lakehouse by moving it through three layers—Bronze (raw data), Silver (cleaned data), and Gold (final, ready-to-use data).
Step 1: Create a notebook and add SQL pipeline code
- In your Databricks workspace, click "+New" in the left sidebar and select Notebook. Name the notebook “NYTaxi Pipeline SQL.”
- Set the notebook’s default language to SQL next to its name. We want to code the pipeline in SQL for simplicity.
- Copy and paste the following code into your new SQL notebook.
Bronze layer
Bronze Table: Raw data ingestion
Here, raw taxi trip data is ingested, with a basic data quality check applied to ensure trip distances are positive.
- Bronze layer: Raw data ingestion
CREATE OR REPLACE TABLE taxi_raw_records AS
SELECT *
FROM samples.nyctaxi.trips
WHERE trip_distance > 0.0;
-Silver layer
The Silver layer creates two tables:
Silver Table 1: Flagged rides
This table identifies potentially suspicious rides based on fare and distance criteria.
-- Silver Table 1: Flagged rides
CREATE OR REPLACE TABLE flagged_rides AS
SELECT
date_trunc("week", tpep_pickup_datetime) AS week,
pickup_zip AS zip,
fare_amount,
trip_distance
FROM
taxi_raw_records
WHERE ((pickup_zip = dropoff_zip AND fare_amount > 50) OR
(trip_distance < 5 AND fare_amount > 50));
Silver Table 2: Weekly statistics
This silver table calculates weekly average fares and trip distances.
- Silver layer 2: Weekly statistics
CREATE OR REPLACE TABLE weekly_stats AS
SELECT
date_trunc("week", tpep_pickup_datetime) AS week,
AVG(fare_amount) AS avg_amount,
AVG(trip_distance) AS avg_distance
FROM
taxi_raw_records
GROUP BY week
ORDER BY week ASC;
Gold layer
Gold Table 1: Top N rides
Here, these silver tables are integrated to provide a comprehensive view of the top three highest-fare rides.
-- Gold layer: Top N rides to investigate
CREATE OR REPLACE TABLE top_n AS
SELECT
ws.week,
ROUND(ws.avg_amount, 2) AS avg_amount,
ROUND(ws.avg_distance, 3) AS avg_distance,
fr.fare_amount,
fr.trip_distance,
fr.zip
FROM
flagged_rides fr
LEFT JOIN weekly_stats ws ON ws.week = fr.week
ORDER BY fr.fare_amount DESC
LIMIT 3;
Step 2: Schedule a notebook job
To ensure that the Bronze, Silver, and Gold tables are regularly updated with fresh data, it’s recommended to schedule the notebook as a job to run periodically. This will keep the tables current with the latest information. To schedule a notebook job to run periodically:
- In the notebook, click the Schedule button at the top right.
- In the Schedule dialog, optionally enter a name for the job. The default name is the name of the notebook.
- Scheduled runs allow you to define a schedule for your job run. Adjust the frequency, time, and time zone for the job run.
- Leave everything else as is and click Create.
- Once the schedule is successfully created, click Run Now to trigger a job run for the NYCTaxiSQL Pipeline.
Step 3: Discover data using Catalog Explorer
Explore and manage your generated datasets through Unity Catalog using Catalog Explorer. Unity Catalog organizes data in a three-level namespace: Catalog.Schema.Table. Follow the steps below to examine the data produced by your pipeline and visualize its lineage. This allows you to inspect sample data, view table details, and explore the end-to-end data lineage of your pipeline objects.
- Click Catalog in the sidebar to open the Catalog Explorer.
- The newly created tables from our NYCTaxiSQLPipeline should be accessible by clicking the Recents top on the top of the page.
- Click weekly_stats to load the table details.
- Click Lineage and then Lineage Graph to view a graphical representation of all the upstream and downstream tables from weekly_stats.
Step 4: Create a Dashboard
- Use the Create button on the right side to create a Dashboard from the weekly_stats table. This automatically creates a new dashboard for the weekly_stats table. Rename the dashboard on the top as NYCTaxiPipelineDash.
- Start building out your dashboard by using the Ask the Assistant prompt at the top of the dashboard to create your first chart using AI. Click one of the auto-generated prompts to get started.
- Add more charts by using the “Add a Visualization” widget from the selector at the bottom and place the cursor on a space on the dashboard. Here are some additional prompts:
- Show a scatter plot of the trip distance and average fare amount by day of the week.
- Visualize total trips across zip codes as a bar chart
- Click Accept to save the AI-generated chart or use the widget on the right to regenerate the chart. Use the
button on the upper-right of each chart to provide a different prompt to update the chart.
- Alternatively, use the chart builder on the right side to choose a chart type and then select values for the x-axis and y-axis.
Step 5: Publish and distribute your dashboard
While you develop a dashboard, your progress is saved as a draft. To create a clean copy for easy consumption, publish your dashboard. After you publish a dashboard, the published version remains intact until you publish again, even if you make changes to the draft. You can make modifications and improvements to the draft version without affecting the published copy.
- When you’re done adding charts to this dashboard, click Publish in the upper-right corner to create a clean copy of the current dashboard.
- Click Publish. Your dashboard is now ready to be shared with other team members.
Add users to your Databricks Workspace
- In the top bar of the Databricks workspace, click your username and then click Settings.
- In the sidebar, click Identity and Access.
- Next to Users, click Manage.
- Click Add user, and then click Add new.
- Enter the user’s email address, and then click Add.
Continue to add as many users to your account as you would like. New users receive an email prompting them to set up their account.
Share the dashboard with colleagues
- To manage access to the dashboard, click Share at the top of the dashboard to open the permissions dialog.
- Enter the email address of the user you want to share the dashboard with and click "Add." They will receive an email notification and will be able to access the dashboard in Databricks.
- [Optional] To share the dashboard with all colleagues, add the “All Users” group to the dashboard’s access list with “Can View” or “Can Run” permissions. You can also copy the dashboard’s URL to your clipboard using the “Copy link” button and send it directly to your colleagues. 🎉
Congratulations!
You’ve now created, run, and analyzed an analytics pipeline demonstrating data processing, data quality checks, and the creation of analytics-ready tables and dashboards. Not only have you gained insights into the taxi trip data, but you’ve also likely learned how expensive a ride in the Big Apple can be if things go wrong — perhaps it’s time to consider the subway!
As you become more comfortable with ETL pipelines, you can expand this pipeline with more complex transformations, data streaming, and more comprehensive data quality checks.
- To learn more about Databricks Jobs, see What is Databricks Jobs?
- To learn more about Delta Lake, see What is Delta Lake?
- To learn more about data processing pipelines with Delta Live Tables, see What is Delta Live Tables?