Note: the following guide is primarily for Python users. For other languages, please view the following links:
• Table batch reads and writes
• Create a table in SQL
• Visualizing data with DBSQL
This step-by-step guide will get your data science projects underway by enabling you to:
• Use display() commands to quickly understand your data
• Process and save data efficiently
• Import any machine learning framework
To start, use the persona switcher to open your Machine Learning homepage
Part 1: Use display() commands to quickly understand your data
View your data in an interactive output and quickly create visualizations using the display() command to view your DataFrame.
1. Create a notebook. Give it a name, set the default language as Python, and select a Cluster
2. Write a command to load your data into a DataFrame, or load the following sample DataFrame
raw_data = spark.read.format("delta").load("/databricks-datasets/nyctaxi-with-zipcodes/subsampled")
3. Use the python display () command to view your Dataframe
display(raw_data)
4. Above displayed results, to the right of Table, click + and select "Visualization"
5. In the Visualization type drop-down, choose a chart type
Recommendation: Use a scatter plot for this data
6. Select the data to appear in the visualization
Recommendation: X column = trip_distance; Y column = fare_amount
7. Click Save
You are now ready to discover new insights from your data.
Part 2: Process and save data efficiently
Save the results of your analysis by persisting the results to storage:
• SQL DDL commands: You can use standard SQL DDL commands supported in Apache Spark (for example, CREATE TABLE AS SELECT) to create Delta tables
• Table batch writes guide:
# Create table in the metastore using DataFrame's schema and write data to it
df.write.format("delta").saveAsTable("default.people10m")
Part 3: Import any machine learning framework
1. Import the necessary libraries. These libraries are preinstalled on Databricks Runtime for Machine Learning (AWS|Azure|GCP) clusters and are tuned for compatibility and performance.
import mlflow
import numpy as np
import pandas as pd
import sklearn.datasets
import sklearn.metrics
import sklearn.model_selection
import sklearn.ensemble
from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK
from hyperopt.pyll import scope
Now you’ve trained your machine learning models, check out the links below for more.
Learn more:
• Databricks introduction to notebooks
• Documentation on how to import, read and modify data
• Guide to creating visualizations
• Data Science getting started guide
• Apache Spark Programming with Databricks course
• Ask a Databricks expert live in Office Hours
• Feel free to contact us
Drop your questions, feedback and tips below!