cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Guides
Access comprehensive guides and tutorials to help you get started with Databricks. Learn how to set up your environment, build your first projects, and harness the full power of the platform.
cancel
Showing results for 
Search instead for 
Did you mean: 
DatabricksGuide
Community Manager
Community Manager

 

Getting Started with Databricks - From Ingest to Analytics & BI

This is an eight-step guide that will help you set up your first Analytics and BI use case on Databricks starting from ingesting data.

Once you finish implementing this guide you’ll have:

  • Ingested data from your cloud storage into Databricks
  • Perform an ETL to your Databricks Delta Lake
  • Built a BI query on your own data
  • Created a Lakeview dashboard and shared it with a teammate

 

Introduction to Databricks

The Databricks Data Intelligence Platform is a comprehensive solution that integrates the Lakehouse architecture with generative AI capabilities. It provides a unified platform for data engineering, data science, machine learning, and analytics tasks. This allows organizations to store, manage, analyze, and derive insights from their data in a seamless and efficient manner.

DatabricksGuide_0-1714140949313.png
  • Easily access multiple data sources with simplified infrastructure management at any scale.
  • Create fast and highly reliable data pipelines – critical for analytics and the data science teams who rely on complete and accurate data for decision-making, analytics and model training.
  • With data at your fingertips, quickly perform analytics and transformations that feed business reports and dashboards

To learn more about the Databricks Data Intelligence platform, visit our website.

 

Analytics & BI on Databricks

Analytics and BI on Databricks starts with Lakeview dashboards. Built for analysts and the end consumer at the center, Lakeview is the best way to share data insights with your team completely unified with the Lakehouse.

  • Data model: queries are not distinct objects in a Lakeview dashboard. In Lakeview, all the content is bundled within the dashboard itself. So, when sharing a dashboard, only the dashboard needs to be shared and there is no need for a separate sharing action for queries.
  • Consistent user permissions across underlying datasets and visualizations. Changes to the draft dashboard do not impact the published version. Draft/publish capabilities allow you to share published dashboards that can be shared with anyone in the organization, including those outside the Databricks workspace.
  • Integrated UC lineage into the published dashboard consumer experience so your business users always know where insights are derived from
DatabricksGuide_1-1714141186202.gif

To learn more about Lakeview dashboards, visit our website.

 

Ingest Setup Steps [20 minutes]

 

Step 0: Check your required prerequisites

This section of the guide assumes you have the following:

  1. You have a Databricks workspace up and running
  2. Your account has Unity Catalog enabled. UC is enabled by default
  3. You have admin permissions for your Databricks account
  4. You have access to your AWS account and data residing in a S3 storage bucket

 

Step 1: Access and start your warehouse

To get started with Databricks, you need to kickstart your starter warehouse. This compute resource will let you query and explore data on Databricks, specifically using SQL commands. There are two options:

  1. In your Databricks workspace, click Compute on the sidebar.
  2. At the top of the page, click on the subtab “SQL Warehouses”
  3. Locate the Starter Warehouse in your SQL warehouse list view
  4. Click the start icon next to the warehouse to kickstart
Note: only users with workspace administrator permissions are able to start the starter warehouse. If you don’t have admin permissions, please contact your administrator to get access to the starter warehouse or to create a new serverless warehouse.

 

Step 2: Connect your workspace to data sources

To connect your Databricks workspace to your cloud storage, you need to create an external location. An external location is an object that combines a cloud storage path with the credential that authorizes access to the storage path.

DatabricksGuide_4-1714141571707.gif

Watch and follow along in your workspace as you go!

  1. In your Databricks workspace, click Catalog on the sidebar.
  2. At the top of the page, click + Add.
  3. Click Add an external location.
  4. Databricks recommends using the AWS Quickstart, which ensures that your workspace is given the correct permissions on the bucket.
  5. In Bucket Name, enter the name of the bucket you want to import data from.
  6. Click Generate New Token and copy the token.
  7. Click Launch in Quickstart.
  8. In your AWS console, enter the copied token in the Databricks Personal Access Token field.
  9. Select the I acknowledge that AWS CloudFormation might create IAM resources with custom names checkbox.
  10. Click Create stack.

    To see the external locations in your workspace, click Catalog in the sidebar, at the bottom of the left navigation pane click External Data, and then click External Locations. Your new external location will have a name using the following syntax: db_s3_external_databricks-S3-ingest-<id>.
  11. Test your connection

To verify external locations have functioning connections, do the following:

  1. Click the external location you want to test.
  2. Click Test connection.

For help, contact onboarding-help@databricks.com or your Databricks account team directly.

 

Step 3: Add your data to Databricks

Now that your workspace has a connection to your S3 bucket, you can add your data.

Part of this step is choosing where to put your data. Databricks has a three-level namespace that organizes your data (catalog.schema.table). For this exercise, you import the data into the default catalog named after your workspace.

  1. In the sidebar of your Databricks workspace, click New > Add data.
  2. Click Amazon S3.
  3. Select your external location from the drop-down menu.
  4. Select all the files you want to add to your Databricks catalog.
  5. Click Preview table.
  6. Select the default catalog (named after your workspace), the default schema, and then enter a name for your table.
  7. Click Create Table.

You can now use Catalog Explorer in your workspace to see your data in Databricks.

 

Step 4: Process your data

The medallion lake house architecture is a way of organizing and managing data in Databricks. It divides the data into three layers: bronze, silver, and gold:

DatabricksGuide_5-1714141715140.png
  • Bronze Layer: This is where we store the raw data as it comes in. It's like keeping the original, untouched version of the data. We can keep adding new data to this layer over time.
  • Silver Layer: After the data comes into the bronze layer, we clean it up and make sure it's accurate. This layer is like a polished version of the data. We remove any duplicate information and make sure everything is correct.
  • Gold Layer: This is where we take the cleaned-up data from the silver layer and make it useful. We do things like combining different pieces of data to find trends or patterns. The data in this layer is what we use for important things like analytics and making decisions.

Let’s create an ETL to walk through the Medallion architecture:

  1. Open a new notebook in your Databricks workspace using the +New button in the top left.
  2. Follow the below learn-as-you-go experience to create an ETL on either ingested data or the NYC Taxi Dataset that is available in the samples catalog in the Catalog Explorer. The code snippets below are examples created off of a Product Sales Dataset. 
  3. Let’s add transformations, more detailed aggregations, and additional checks for data integrity. These steps ensure more refined data handling and can provide deeper insights into data.
By the end, generate 1 bronze table, 2 silver tables, and 1 gold layer table to demonstrate the value of the multi-hop architecture, preparing the data for analysis and visualization.
  1. Bronze Layer

In the Bronze layer, we can add logging and error handling to manage issues during data ingestion.

sql
-- Creating the Bronze table with error handling columns
CREATE TABLE bronze_sales_data (
    id STRING,
    transaction_date TIMESTAMP,
    amount DECIMAL(10,2),
    product_id STRING,
    quantity INT,
    customer_id STRING,
    error_description STRING
);
  1. Silver Layer

We can introduce more complex transformations in the Silver layer, such as handling different time zones, computing running totals, and enhancing customer profiles based on transaction history.

Silver Table 1: Enhanced Transactions

sql
Copy code
-- Create a Silver table for enhanced transaction data
CREATE TABLE silver_transactions AS
SELECT
    id,
    CAST(from_utc_timestamp(transaction_date, 'America/Los_Angeles') AS DATE) AS local_transaction_date,
    amount,
    product_id,
    quantity,
    customer_id,
    (amount * 1.1) AS total_amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY transaction_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS customer_lifetime_value
FROM bronze_sales_data
WHERE error_description IS NULL;

Silver Table 2: Customer Demographics

Materialized views (MVs) are like regular views but are pre-computed and automatically kept up-to-date as the sources change. End-user queries and dashboards are fast because they are querying data that is precomputed instead of directly querying potentially massive data volumes in the source tables.

Create a materialized view in the silver layer to meet analytics needs. 

sql
Copy code
-- Create a materialized view to enhance customer demographic information
CREATE MATERIALIZED VIEW silver_customer_demographics AS
SELECT
    c.customer_id,
    c.name,
    c.age,
    c.gender,
    SUM(t.amount) AS total_spent,
    AVG(t.amount) AS average_spent,
    MAX(t.transaction_date) AS last_purchase_date
FROM bronze_customer_data c
JOIN silver_transactions t ON c.customer_id = t.customer_id
GROUP BY c.customer_id, c.name, c.age, c.gender;

 

  1. Gold Layer

Finally, in the Gold layer, we integrate these datasets to provide a comprehensive view suitable for business decision-making, like customer segmentation based on purchasing patterns and demographics.

sql
Copy code
-- Create a Gold table for strategic business insights
CREATE TABLE gold_customer_segmentation AS
SELECT
    d.customer_id,
    d.name,
    d.age,
    d.gender,
    d.total_spent,
    d.average_spent,
    d.last_purchase_date,
    CASE 
        WHEN d.total_spent > 1000 THEN 'High-value'
        WHEN d.total_spent BETWEEN 500 AND 999 THEN 'Medium-value'
        ELSE 'Low-value'
    END AS customer_value_segment,
    CASE 
        WHEN datediff(current_date(), d.last_purchase_date) <= 30 THEN 'Active'
        ELSE 'Inactive'
    END AS customer_activity_status
FROM silver_customer_demographics d;

For help, contact onboarding-help@databricks.com or your Databricks account team directly.

 

Step 5: Data Discovery using Catalog Explorer

Now that you have your generated datasets, let’s explore how they are managed through Unity Catalog. Catalog Explorer allows you to explore and manage assets hosted in your metastore. Each catalog can hold data, schemas (databases), tables, models, functions, and other assets and automatically generate AI-assisted insights, summary, and search.

Unity Catalog hierarchy flows starting with a three-level namespace from the metastore that organizes your data in the following order:

  • Catalog: first layer that organizes your data assets
  • Schema: second layer that contains tables or views
  • Table: lowest level of the hierarchy where your datasets are located.
DatabricksGuide_6-1714141715076.png

Click on a table in the Catalog Explorer to can more information about the columns, preview some sample data, view details, permissions, as well as lineage, insights, and quality.

  1. In the sidebar of your Databricks workspace, click Catalog
  2. Click the Main catalog
  3. Select your schema
  4. Select one of your datasets from Step 4. Your view should look something like below:
    DatabricksGuide_7-1714141715201.png
  5. Let’s walk through the priority sub-tabs.
  • History: provides the historical logs of the dataset including all timestamp data of what happens to that dataset.
  • Lineage: see the upstream and downstream table.
  • Insights: can see frequent queries, users, etc that are run on this table. Ability to build out analytics out on engineering effectiveness of the data
  • Quality: ability to setup quality checks 

 

Step 6: Create your Lakeview dashboard

This step will allow you to create and share insights using your output datasets from Step 4. For this step, you will create a Lakeview dashboard by selecting the tables from the default catalog named after your workspace.

  1. In the sidebar of your Databricks workspace, click New > Lakeview Dashboard.
  2. Click the Data tab. This is where you define the underlying datasets used in your dashboard.
  3. Click “Select a table”
  4. Search for the tables you created in Step 4. Select the catalog and then schema namespace to find and bring in your datasets into Lakeview.
    DatabricksGuide_5-1714143948907.gif

     

  5. If you would like to create additional aggregated tables on top of your existing datasets, click “Create from SQL”. This allows you to define datasets for your dashboard directly using SQL queries.
  6. Click Canvas on the top navigation bar
  7. On the bottom middle of your dashboard canvas, you will see three widget types to start building your dashboard. To start, click “Add a visualization” to create a visualization widget and use your mouse to place it in the canvas
    DatabricksGuide_1-1714143682157.png
  8. Configure your visualization and continue to build your dashboard with text boxes and filters widgets.

For inspirational visualizations, follow the quick start guide for a detailed walkthrough of creating a sample dashboard using the NYC Taxi Dataset.

 

Note: this dataset is not Unity Catalog managed, therefore the datasets by writing SQL queries with the “Create from SQL” option rather than selecting tables. All users can write SQL queries to define a dataset.

 

Step 7: Publish and distribute your Lakeview dashboard


1. Publish - while you develop a dashboard, your progress is saved as a draft. To create a clean copy for easy consumption, publish your dashboard. A published dashboard accesses live data, running queries with your embedded or viewing user’s credentials. There are two types of credentials you can set:

  • Embed credentials (By default and recommended): If you are sharing with users in your organization without Databricks workspace access, select embed credentials so that all viewers of the published dashboard can use your data and warehouse permissions to run queries.
  • Non Embed Credentials: each viewer of the published dashboard will need to use their own data and warehouse permissions to run queries.
DatabricksGuide_2-1714143682230.gif

A pop-up will appear where you can view the published dashboard.

DatabricksGuide_3-1714143682179.png

2. Schedule

  • On your published dashboard and click the Schedule button
  • Create your schedule with the setting configurations.

3. Share and Subscribe

  • Return to your draft dashboard and click the Share button
  • Add users or groups who you want to share with. Set permission levels as appropriate. To learn more about published Lakeview dashboard permissions, visit our documentation here.
DatabricksGuide_4-1714143682205.png
  • If you created a schedule, users that published dashboard will have the option to subscribe. You can also directly subscribe Databricks users to each defined schedule.

 

Version history
Last update:
‎05-10-2024 12:43 PM
Updated by:
Contributors