Databricks Community

Anonymous · ‎09-07-2022

Note: the following guide is primarily for Python users. For other languages, please view the following links:

• Table batch reads and writes

• Create a table in SQL

• Visualizing data with DBSQL

This step-by-step guide will get your data science projects underway by enabling you to:

• Use display() commands to quickly understand your data

• Process and save data efficiently

• Import any machine learning framework

To start, use the persona switcher to open your Machine Learning homepage

Part 1: Use display() commands to quickly understand your data

View your data in an interactive output and quickly create visualizations using the display() command to view your DataFrame.

1. Create a notebook. Give it a name, set the default language as Python, and select a Cluster

2. Write a command to load your data into a DataFrame, or load the following sample DataFrame

raw_data = spark.read.format("delta").load("/databricks-datasets/nyctaxi-with-zipcodes/subsampled")

3. Use the python display () command to view your Dataframe

display(raw_data)

4. Above displayed results, to the right of Table, click + and select "Visualization"

5. In the Visualization type drop-down, choose a chart type

Recommendation: Use a scatter plot for this data

6. Select the data to appear in the visualization

Recommendation: X column = trip_distance; Y column = fare_amount

7. Click Save

You are now ready to discover new insights from your data.

Part 2: Process and save data efficiently

Save the results of your analysis by persisting the results to storage:

• SQL DDL commands: You can use standard SQL DDL commands supported in Apache Spark (for example, CREATE TABLE AS SELECT) to create Delta tables

• Table batch writes guide:

# Create table in the metastore using DataFrame's schema and write data to it

df.write.format("delta").saveAsTable("default.people10m")

Part 3: Import any machine learning framework

1. Import the necessary libraries. These libraries are preinstalled on Databricks Runtime for Machine Learning (AWS|Azure|GCP) clusters and are tuned for compatibility and performance.

import mlflow

import numpy as np

import pandas as pd

import sklearn.datasets

import sklearn.metrics

import sklearn.model_selection

import sklearn.ensemble

from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK

from hyperopt.pyll import scope

Now you’ve trained your machine learning models, check out the links below for more.

Learn more:

• Databricks introduction to notebooks

• Documentation on how to import, read and modify data

• Guide to creating visualizations

• Data Science getting started guide

• Apache Spark Programming with Databricks course

• Ask a Databricks expert live in Office Hours

• Feel free to contact us

Drop your questions, feedback and tips below!

Priyag1 · ‎05-03-2023

I got good knowledge by your post . It is very clear . Thank you . Keep sharing like this posts .It will be helpful

Databricks Community

Train machine learning models: How can I take my ML lifecycle from experimentation to production?

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples