Databricks Community

DatabricksGuide · ‎05-10-2024

Getting Started with Databricks - Zero to AIBI in 30 minutes

This is a five-step guide that will help you set up your first Analytics and BI use case on Databricks.

Once you finish implementing this guide you’ll have:

Perform an ETL to your Databricks Delta Lake
Created a Lakeview dashboard and shared it with a teammate

Introduction to Databricks

The Databricks Data Intelligence Platform is a comprehensive solution that integrates the Lakehouse architecture with generative AI capabilities. It provides a unified platform for data engineering, data science, machine learning, and analytics tasks. This allows organizations to store, manage, analyze, and derive insights from their data seamlessly and efficiently.

Easily access multiple data sources with simplified infrastructure management at any scale.
Create fast and highly reliable data pipelines – critical for analytics and the data science teams who rely on complete and accurate data for decision-making, analytics and model training.
With data at your fingertips, quickly perform analytics and transformations that feed business reports and dashboards

To learn more about the Databricks Data Intelligence platform, visit our website.

Analytics & BI on Databricks

Analytics and BI on Databricks starts with Lakeview dashboards. Built for analysts and the end consumer at the center, Lakeview is the best way to share data insights with your team completely unified with the Lakehouse.

Data model: queries are not distinct objects in a Lakeview dashboard. In Lakeview, all the content is bundled within the dashboard itself. So, when sharing a dashboard, only the dashboard needs to be shared and there is no need for a separate sharing action for queries.
Consistent user permissions across underlying datasets and visualizations. Changes to the draft dashboard do not impact the published version. Draft/publish capabilities allow you to share published dashboards that can be shared with anyone in the organization, including those outside the Databricks workspace.
Integrated UC lineage into the published dashboard consumer experience so your business users always know where insights are derived from

To learn more about Lakeview dashboards, visit our website.

Step 1: Create a notebook and add SQL pipeline code

In your Databricks workspace, click "+New" in the left sidebar and select Notebook. Name the notebook “NYTaxi Pipeline SQL.”
Set the notebook’s default language to SQL next to its name. We want to code the pipeline in SQL for simplicity.
Copy and paste the following code into your new SQL notebook.

Bronze Layer

In the Bronze layer, we can add logging and error handling to manage issues during data ingestion.

-- Bronze layer: Raw data ingestion

CREATE OR REPLACE TABLE taxi_raw_records AS
SELECT *
FROM samples.nyctaxi.trips
WHERE trip_distance > 0.0;

Silver Layer

We can introduce more complex transformations in the Silver layer, such as handling different time zones, computing running totals, and enhancing customer profiles based on transaction history.

-- Silver Table 1: Flagged Rides

CREATE OR REPLACE TABLE flagged_rides AS
SELECT
  date_trunc("week", tpep_pickup_datetime) AS week,
  pickup_zip AS zip,
  fare_amount,
  trip_distance
FROM
  taxi_raw_records
WHERE ((pickup_zip = dropoff_zip AND fare_amount > 50) OR
       (trip_distance < 5 AND fare_amount > 50));

Silver Table 2: Weekly statistics

Materialized views (MVs) are like regular views but are pre-computed and automatically kept up-to-date as the sources change. End-user queries and dashboards are fast because they are querying data that is precomputed instead of directly querying potentially massive data volumes in the source tables. Create a materialized view in the silver layer to meet analytics needs.

- Silver layer 2: Weekly statistics

CREATE OR REPLACE MATERIALIZED VIEW weekly_stats AS
SELECT
  date_trunc("week", tpep_pickup_datetime) AS week,
  AVG(fare_amount) AS avg_amount,
  AVG(trip_distance) AS avg_distance
FROM
  taxi_raw_records
GROUP BY week
ORDER BY week ASC;

Gold layer

Finally, in the Gold layer, we integrate these datasets to provide a comprehensive view suitable for business decision-making, like customer segmentation based on purchasing patterns and demographics.

-- Gold layer: Top N rides to investigate

CREATE OR REPLACE TABLE top_n AS
SELECT
  ws.week,
  ROUND(ws.avg_amount, 2) AS avg_amount,
  ROUND(ws.avg_distance, 3) AS avg_distance,
  fr.fare_amount,
  fr.trip_distance,
  fr.zip
FROM
  flagged_rides fr
LEFT JOIN weekly_stats ws ON ws.week = fr.week
ORDER BY fr.fare_amount DESC
LIMIT 3;

Step 2: Schedule a notebook job

To schedule a notebook job to run periodically:

In the notebook, click the Schedule button at the top right.
In the Schedule dialog, optionally enter a name for the job. The default name is the name of the notebook.
Scheduled runs allow you to define a schedule for your job run. Adjust the frequency, time, and time zone for the job run.
Leave everything else as is and click Create.
Once the schedule is successfully created, click Run Now to trigger a job run for the NYCTaxiSQL Pipeline.

Step 3: Discover data using Catalog Explorer

Explore and manage your generated datasets through Unity Catalog using Catalog Explorer. Browse assets in your metastore, including data, schemas, tables, models, and functions. Unity Catalog organizes data in a three-level namespace: Catalog.Schema.Table.

Click Catalog in the sidebar to open the Catalog Explorer.
The newly created tables from our NYCTaxiSQLPipeline should be accessible by clicking the Recents top on the top of the page.
Click weekly_stats to load the table details.
Click Lineage and then Lineage Graph to view a graphical representation of all the upstream and downstream tables from weekly_stats.

Step 4: Create a Lakeview Dashboard

Use the Create button on the right side to create a Dashboard from the weekly_stats table. This automatically creates a new dashboard for the weekly_stats table. Rename the dashboard on the top as NYCTaxiPipelineDash.

Start building out your dashboard by using the Ask the Assistant prompt at the top of the dashboard to create your first chart using AI. Click one of the auto-generated prompts to get started. Here are some additional prompts:

Show a scatter plot of the trip distance and average fare amount by day of the week.

Visualize total trips across zip codes

Click Accept to save the AI-generated chart or use the widget on the right to regenerate the chart. Use the

button on the upper-right of each chart to provide a different prompt to update the chart.

Step 5: Publish and distribute your Lakeview dashboard

When you’re done adding charts to this dashboard, click Publish in the upper-right corner to create a clean copy of the current dashboard.
Click Publish. Your dashboard is now ready to be shared with other team members.

Add users to your Databricks Workspace

In the top bar of the Databricks workspace, click your username and then click Settings.
In the sidebar, click Identity and access.
Next to Users, click Manage.
Click Add user, and then click Add new.
Enter the user’s email address, and then click Add.

Continue to add as many users to your account as you would like. New users receive an email prompting them to set up their account.

Share the dashboard with colleagues

To manage access to the dashboard, click Share at the top of the dashboard to open the permissions dialog.
Share your Dashboard with any colleague by adding the “All Users” group to the notebook’s access list with “Can View” or “Can Run” permission and sending your colleague the notebook’s URL, which can be copied to your clipboard using the “Copy link” button! 🎉

Next steps

To learn more about Databricks notebooks, see Introduction to Databricks notebooks.
To learn more about Databricks Jobs, see What is Databricks Jobs?.
To learn more about Delta Lake, see What is Delta Lake?.
To learn more about data processing pipelines with Delta Live Tables, see What is Delta Live Tables?.