Getting Started with Databricks - Zero to AIBI in 30 minutes
This is a five-step guide that will help you set up your first Analytics and BI use case on Databricks.
Once you finish implementing this guide you’ll have:
- Perform an ETL to your Databricks Delta Lake
- Created a Lakeview dashboard and shared it with a teammate
|
Introduction to Databricks
The Databricks Data Intelligence Platform is a comprehensive solution that integrates the Lakehouse architecture with generative AI capabilities. It provides a unified platform for data engineering, data science, machine learning, and analytics tasks. This allows organizations to store, manage, analyze, and derive insights from their data seamlessly and efficiently.
- Easily access multiple data sources with simplified infrastructure management at any scale.
- Create fast and highly reliable data pipelines – critical for analytics and the data science teams who rely on complete and accurate data for decision-making, analytics and model training.
- With data at your fingertips, quickly perform analytics and transformations that feed business reports and dashboards
To learn more about the Databricks Data Intelligence platform, visit our website.
Analytics & BI on Databricks
Analytics and BI on Databricks starts with Lakeview dashboards. Built for analysts and the end consumer at the center, Lakeview is the best way to share data insights with your team completely unified with the Lakehouse.
- Data model: queries are not distinct objects in a Lakeview dashboard. In Lakeview, all the content is bundled within the dashboard itself. So, when sharing a dashboard, only the dashboard needs to be shared and there is no need for a separate sharing action for queries.
- Consistent user permissions across underlying datasets and visualizations. Changes to the draft dashboard do not impact the published version. Draft/publish capabilities allow you to share published dashboards that can be shared with anyone in the organization, including those outside the Databricks workspace.
- Integrated UC lineage into the published dashboard consumer experience so your business users always know where insights are derived from
To learn more about Lakeview dashboards, visit our website.
Step 1: Create a notebook and add SQL pipeline code
- In your Databricks workspace, click "+New" in the left sidebar and select Notebook. Name the notebook “NYTaxi Pipeline SQL.”
- Set the notebook’s default language to SQL next to its name. We want to code the pipeline in SQL for simplicity.
- Copy and paste the following code into your new SQL notebook.
Bronze Layer
In the Bronze layer, we can add logging and error handling to manage issues during data ingestion.
-- Bronze layer: Raw data ingestion
CREATE OR REPLACE TABLE taxi_raw_records AS
SELECT *
FROM samples.nyctaxi.trips
WHERE trip_distance > 0.0;
Silver Layer
We can introduce more complex transformations in the Silver layer, such as handling different time zones, computing running totals, and enhancing customer profiles based on transaction history.
-- Silver Table 1: Flagged Rides
CREATE OR REPLACE TABLE flagged_rides AS
SELECT
date_trunc("week", tpep_pickup_datetime) AS week,
pickup_zip AS zip,
fare_amount,
trip_distance
FROM
taxi_raw_records
WHERE ((pickup_zip = dropoff_zip AND fare_amount > 50) OR
(trip_distance < 5 AND fare_amount > 50));
Silver Table 2: Weekly statistics
Materialized views (MVs) are like regular views but are pre-computed and automatically kept up-to-date as the sources change. End-user queries and dashboards are fast because they are querying data that is precomputed instead of directly querying potentially massive data volumes in the source tables. Create a materialized view in the silver layer to meet analytics needs.
- Silver layer 2: Weekly statistics
CREATE OR REPLACE MATERIALIZED VIEW weekly_stats AS
SELECT
date_trunc("week", tpep_pickup_datetime) AS week,
AVG(fare_amount) AS avg_amount,
AVG(trip_distance) AS avg_distance
FROM
taxi_raw_records
GROUP BY week
ORDER BY week ASC;
Gold layer
Finally, in the Gold layer, we integrate these datasets to provide a comprehensive view suitable for business decision-making, like customer segmentation based on purchasing patterns and demographics.
-- Gold layer: Top N rides to investigate
CREATE OR REPLACE TABLE top_n AS
SELECT
ws.week,
ROUND(ws.avg_amount, 2) AS avg_amount,
ROUND(ws.avg_distance, 3) AS avg_distance,
fr.fare_amount,
fr.trip_distance,
fr.zip
FROM
flagged_rides fr
LEFT JOIN weekly_stats ws ON ws.week = fr.week
ORDER BY fr.fare_amount DESC
LIMIT 3;
Step 2: Schedule a notebook job
To schedule a notebook job to run periodically:
- In the notebook, click the Schedule button at the top right.
- In the Schedule dialog, optionally enter a name for the job. The default name is the name of the notebook.
- Scheduled runs allow you to define a schedule for your job run. Adjust the frequency, time, and time zone for the job run.
- Leave everything else as is and click Create.
- Once the schedule is successfully created, click Run Now to trigger a job run for the NYCTaxiSQL Pipeline.
Step 3: Discover data using Catalog Explorer
Explore and manage your generated datasets through Unity Catalog using Catalog Explorer. Browse assets in your metastore, including data, schemas, tables, models, and functions. Unity Catalog organizes data in a three-level namespace: Catalog.Schema.Table.
- Click Catalog in the sidebar to open the Catalog Explorer.
- The newly created tables from our NYCTaxiSQLPipeline should be accessible by clicking the Recents top on the top of the page.
- Click weekly_stats to load the table details.
- Click Lineage and then Lineage Graph to view a graphical representation of all the upstream and downstream tables from weekly_stats.
Step 4: Create a Lakeview Dashboard
- Use the Create button on the right side to create a Dashboard from the weekly_stats table. This automatically creates a new dashboard for the weekly_stats table. Rename the dashboard on the top as NYCTaxiPipelineDash.
- Start building out your dashboard by using the Ask the Assistant prompt at the top of the dashboard to create your first chart using AI. Click one of the auto-generated prompts to get started. Here are some additional prompts:
- Show a scatter plot of the trip distance and average fare amount by day of the week.
- Visualize total trips across zip codes
- Click Accept to save the AI-generated chart or use the widget on the right to regenerate the chart. Use the
button on the upper-right of each chart to provide a different prompt to update the chart.
Step 5: Publish and distribute your Lakeview dashboard
- When you’re done adding charts to this dashboard, click Publish in the upper-right corner to create a clean copy of the current dashboard.
- Click Publish. Your dashboard is now ready to be shared with other team members.
Add users to your Databricks Workspace
- In the top bar of the Databricks workspace, click your username and then click Settings.
- In the sidebar, click Identity and access.
- Next to Users, click Manage.
- Click Add user, and then click Add new.
- Enter the user’s email address, and then click Add.
Continue to add as many users to your account as you would like. New users receive an email prompting them to set up their account.
Share the dashboard with colleagues
- To manage access to the dashboard, click Share at the top of the dashboard to open the permissions dialog.
- Share your Dashboard with any colleague by adding the “All Users” group to the notebook’s access list with “Can View” or “Can Run” permission and sending your colleague the notebook’s URL, which can be copied to your clipboard using the “Copy link” button! 🎉
Next steps