Databricks Community

DatabricksGuide · ‎08-15-2024

Getting started with Databricks - Exploratory analysis

This guide walks you through using a Databricks notebook to query sample data stored in Unity Catalog using Python and then visualize the query results in the notebook.

This is a beginner’s tutorial with hands-on instructions to execute in your own Databricks workspace. You can request a free 14-day trial.

Step 1: Access sample data

Click Catalog in the sidebar to open the Catalog Explorer. This is the primary UI for exploring and managing data, including schemas, tables, models, and other data objects.
Click the samples catalog. This Databricks-managed catalog contains two sample datasets for your exploration: nyctaxi and tpch.
Click the nyctaxi schema, and then open the trips table.

Step 2: Create a notebook

Use the Create button in the top-right to create a notebook with a pre-populated query of the trips table. New, blank queries and notebooks can also be created using the +New button from the top left of the screen.
If you created a blank notebook or query, you can query the table you just created using the SQL SELECT statement SELECT * from <catalog-name>.<schema-name>.<table-name>. If you choose to run your analysis in a notebook, be sure to change the notebook language to SQL by changing the language selection toggle in the top pane of the notebook.
Here are some sample commands to try out in Python:

import pandas as pd

# Read the sample NYC Taxi Trips dataset and load it into a PySpark DataFrame.Convert  PySpark DataFrame to Pandas DataFrame
df = spark.read.table('samples.nyctaxi.trips')
pdf = df.toPandas()

# Select the 10 most expensive and the 10 longest trips based on fare_amount
most_expensive_trips = pdf.nlargest(10,'fare_amount')
longest_trips = pdf.nlargest(10,'trip_distance')

Step 3: Use the Databricks Assistant for code suggestions

Notebooks come equipped with the context-aware Databricks Assistant, which can help generate, explain, and fix code using natural language.
To use the assistant, create a new cell and click CMD+I or click the Assistant icon on the top right corner of the new cell.
Enter a prompt for the Assistant to provide code suggestions. Here are some sample prompts:

What is the most common pickup zip across all trips?
What is the minimum, maximum, and average fare amount across all trips?

Press Return or the submit button to submit the prompt and watch the Assistant suggest code to answer the prompt. Click “Accept” to save the code suggestion and run the cell to view the results!

Step 4: Visualize the data

You can visualize the data in your table from the results of the query by clicking the + button at the top of the results experience and completing the visualization builder dialog.
Select your preferred visualization type, and fill out the chart values to prepare the chart.

Step 5: Share its results with non-Databricks users

Add users to your Databricks Workspace

In the top bar of the Databricks workspace, click your username and then click Settings.
In the sidebar, click Identity and Access.
Next to Users, click Manage.
Click Add user, and then click Add new.
Enter the user’s email address, and then click Add.

Continue to add as many users to your account as you would like. New users receive an email prompting them to set up their account.

Share the notebook with colleagues

To manage access to the notebook, click Share at the top of the notebook to open the permissions dialog.
Share your notebook with any colleague by adding the “All Users” group to the notebook’s access list with “Can View” or “Can Run” permission and sending your colleague the notebook’s URL, which can be copied to your clipboard using the “Copy link” button.

Next steps

To learn about adding data from CSV files to Unity Catalog and visualize data, see Get started: Import and visualize CSV data from a notebook.
To learn how to load data into Databricks using Apache Spark, see Tutorial: Load and transform data using Apache Spark DataFrames.
To learn more about ingesting data into Databricks, see Ingest data into a Databricks lakehouse.
To learn more about querying data with Databricks, see Query data.
To learn more about visualizations, see Visualizations in Databricks notebooks.