cancel
Showing results for 
Search instead for 
Did you mean: 
Databricks University Alliance
cancel
Showing results for 
Search instead for 
Did you mean: 
Jenni
Databricks Employee
Databricks Employee

Jenni_0-1758283408482.png

 

Features

Community Edition

Free Edition

Notebooks

ML Flow

Ingestion

Jobs

Pipelines

Dashboards

Genie

Semantic Search

Model Serving

Model Evaluation

Agents

Unity Catalog

Clean Rooms

Lakebase

(not yet)

Agent Bricks

(not yet)

Enterprise Admin

Classic Compute

Serverless Compute

GPUs

BYO*

* BYO = Bring Your Own

1- Ingestion:

Databricks Ingestion refers to the process of bringing data from various sources into the Databricks Lakehouse Platform so it can be processed, analyzed, or stored. It is a critical first step in building data pipelines and analytics workflows on Databricks.

Definition:

Databricks Ingestion is the method of importing structured, semi-structured, or unstructured data from internal or external sources into a Databricks workspace, where it is typically written to Delta Lake tables for reliable, scalable processing and analytics.

Ingestion Methods:

  1. Batch Ingestion
    Load data in scheduled batches (e.g., daily files or table updates)
    • Tools: Autoloader, COPY INTO, Databricks Jobs
    • Sources: S3, ADLS, GCS, on-premise databases, CSV/JSON/Parquet files
  2. Streaming Ingestion
    Real-time or near-real-time ingestion from sources like Kafka or Event Hubs
    • Tool: Structured Streaming in Spark
    • Use case: IoT, logs, real-time analytics
  3. Autoloader 
    • Cloud-native, scalable tool for incrementally ingesting files from cloud storage
    • Supports schema evolution, file notification, and backfill
    • Ideal for both batch and micro-batch pipelines
    • Docs: Autoloader
  4. Partner Connect / Integrations
    Prebuilt connectors for tools like:
    • Fivetran, dbt, Informatica, Airbyte, etc.
    • Simplifies ingesting from SaaS apps and databases
  5. APIs and Custom Code
    Use Spark APIs or Python scripts to read/write data from custom sources

Resources:

2 - Jobs: 

Databricks Jobs are a feature that lets you automate and schedule your data processing workflows. A job in Databricks runs a notebook, JAR, Python script, or SQL query on a defined schedule or in response to an event — making it essential for production data pipelines, ETL, machine learning, and more.

Definition:

A Databricks Job is a reusable, configurable task that executes code (e.g., a notebook or script) on the Databricks platform according to a defined schedule, trigger, or dependency.

Core Capabilities:

Feature

Description

  • Task orchestration

Run multiple tasks in a sequence with dependencies (DAG-style)

  • Scheduling

Run jobs on a cron-like schedule or at specific intervals

  • Triggers

Start jobs manually, on a schedule, or via API/Webhook

  • Retries and alerts

Configure retry logic and failure notifications (email, webhooks, etc.)

  • Cluster control

Run jobs on existing, new, or serverless clusters

  • Job runs and logs

Monitor job history, logs, and metrics in the UI or via the API

  • Parameterization

Pass parameters to make jobs dynamic (e.g., different input paths/dates)

Supported Job Types:

  • Notebook (most common)
  • JAR (Scala/Java)
  • Python scripts
  • SQL queries or dashboards
  • dbt tasks (via workflows)

Example Use Cases:

  • ETL pipelines: Ingest → Transform → Load
  • Scheduled ML model training and evaluation
  • Daily reporting dashboards
  • Data quality checks
  • Refreshing feature stores or Delta Live Tables

Resources:

3 - Pipelines: 

In Databricks, pipelines are automated, scalable workflows that help you build, manage, and monitor end-to-end data processing tasks — such as ETL (Extract, Transform, Load), machine learning workflows, or streaming applications. There are two main types of pipelines in Databricks:

  1. Delta Live Tables Pipelines (DLT Pipelines)

Definition:

A Delta Live Tables pipeline is a managed pipeline that lets you define and automate reliable, production-grade data transformations using declarative SQL or Python.

Key Features:

  • Declarative syntax: CREATE LIVE TABLE ...AS SELCT ...
  • Built-in data quality checks (EXPECT statements)
  • Automatic dependency resolution between tables
  • Incremental processing (efficient for large/streaming data)
  • Built-in lineage tracking and monitoring
  • Change data capture (CDC) support

Example Use Case:

  • Create Bronze → Silver → Gold data layers
  • Ensure data quality rules before loading downstream
  • Automatically refresh tables daily/hourly

Resources:

2. Job Pipelines or Workflows Pipelines

Definition:

A Databricks Jobs pipeline (aka workflow) is a sequence of tasks (e.g., notebooks, scripts, or queries) that execute in order or in parallel, similar to an orchestrated DAG.

This is often what people mean when they refer to "pipelines" in Databricks Jobs.

Key Features:

  • Flexible orchestration of multiple job tasks
  • Dependency management between tasks
  • Event-based or scheduled execution
  • Parameter passing between tasks
  • Integration with Git for CI/CD workflows

Example Use Case:

  • Step 1: Ingest raw data
  • Step 2: Transform and clean
  • Step 3: Train machine learning model
  • Step 4: Publish results to dashboard or database

Resources:

Summary Table:

Pipeline Type

Description

  • Delta Live Tables (DLT)

For declarative data transformations (SQL/PySpark), with auto-optimizations

  • Jobs/Workflows Pipelines

For custom task orchestration (any code/script/tool) in a flexible DAG

4 - Dashboards: 

Databricks Dashboards are visual tools within the Databricks platform that allow users to create, share, and view visualizations of data from notebooks or SQL queries. They are useful for presenting insights to stakeholders without needing them to dive into raw data or code.

What Can You Do with Databricks Dashboards?

  • Visualize data using bar charts, line graphs, pie charts, maps, tables, and more
  • Pin results from SQL queries or notebook cells directly into a dashboard
  • Share dashboards with team members or external users (via links or workspace permissions)
  • Schedule refreshes to keep data visualizations up to date
  • Use dashboards in full-screen mode for live presentations or wall displays

Use Cases:

  • KPI tracking (e.g., revenue, user growth)
  • Data quality monitoring
  • Machine learning model performance tracking
  • Operational metrics (e.g., system logs, ETL status)

Key Features:

Feature

Description

  • Live Visuals

Refresh dynamically from SQL queries or notebooks

  • Drag-and-Drop UI

Easy layout customization

  • Access Control

Workspace-level permission management

  • Embedded Visuals

Embeddable in apps or web portals (Enterprise only)

How to Create a Dashboard in Databricks:

  1. Run a Query or Notebook Cell
    Use SQL or Python to analyze your data
  2. Click “+ Add to Dashboard”
    On the output cell, choose the dashboard (new or existing)
  3. Open the Dashboard Editor
    Arrange, resize, and configure visuals
  4. Set Permissions
    Share with others and define view/edit rights

Resources:

5 - Genie: 

Databricks Genie is an AI-powered, no-code interface that lets users—especially business or non-technical teams—interact with data using natural language. Genie translates questions into SQL queries, runs them via Databricks SQL, and returns results along with visualizations.

How Genie “Thinks”:

  • Compound AI system: Genie comprises multiple agents responsible for tasks like planning, natural language parsing, SQL generation, visualization, and quality checks 
  • Guided learning: Optional human feedback—such as thumbs up/down, curated SQL examples, and table metadata—helps Genie refine its accuracy within a "space" (a curated context) 

How to Use Genie:

  1. Setup a Genie space: Link 1–25 tables or views registered in Unity Catalog
  2. Annotate metadata: Add descriptions, synonyms, and example values to tables and columns
  3. Provide guidance: Include example SQL queries and text instructions to shape Genie’s responses
  4. Interact via UI: Users ask questions, review SQL generated, run queries, add visualizations, or refine through feedback—all within the same interface 

Benefits & Considerations:

  • Empowers self-service analytics: Enables business users to get real-time insights without writing SQL 
  • Accuracy depends on metadata: Genie is only as precise as the context provided; quality metadata is essential 
  • Privacy safeguards: Data remains secure—row‑level data isn't visible to the model by default. Genie uses schema metadata and can optionally sample values. Moreover, Databricks with Azure OpenAI ensures that prompts and queries are not stored

Summary:

Genie bridges the gap between technical data assets and business users, offering an intelligent natural-language assistant for querying, charting, and exploring data—powered by SQL under the hood and enhanced with ongoing human-guided refinement.

Here’s a streamlined setup guide for Databricks AI/BI Genie, including environment requirements, step-by-step instructions, and best practices:

  1. Prerequisites & Requirements:
  • Unity Catalog: Store your datasets as tables or views within Unity Catalog (up to 25 assets per Genie space) 
  • Compute Resource: Use a Databricks SQL Pro or serverless SQL warehouse and ensure you have CAN USE rights 
  • Permissions:
    • SQL entitlement for creator/editor roles
    • CAN USE on the warehouse
    • SELECT on relevant catalog tables
    • CAN EDIT on the Genie space to set it up
  1. Create and Configure a Genie Space:
  • Go to the Genie sidebar, click New
  • Select up to 25 related tables/views from Unity Catalog
  • Build a Knowledge Store: add table/column descriptions, synonyms, value sampling to improve query responses 
  • Provide clear custom instructions, example SQL queries, and optional metric views to align Genie with business logic 
  1. Best Practices for Curation:
  • Start small & focused: use 5–10 tables with <50 columns and clearly connected relationships 
  • Annotate thoroughly: add descriptive column/table metadata and key definitions (PKs/FKs) for better model understanding 
  • Iterate based on usage: monitor Genie interactions, refine instructions and example queries over time 
  1. Publish & Share:
  • Share the space with colleagues, ensuring users have CAN RUN, CAN USE (warehouse), and SELECT rights New chat contexts are captured in Genie; you can also duplicate spaces using Clone.

Summary Workflow:

  • Register tables/views in Unity Catalog
  • Create a Genie space and link compute resources
  • Curate via metadata, instructions, and example queries
  • Monitor & refine based on usage
  • Share with appropriate permissions

Resources:

6 - Semantic Search:

Databricks Semantic Search, part of the Mosaic AI Vector Search feature, enables hybrid semantic+keyword search on text data stored in Delta tables. It uses vector embeddings plus traditional keyword search, combining results via Reciprocal Rank Fusion to deliver more meaningful and context-aware retrieval.

Core Features:

  • Hybrid keyword-semantic search: Merges embedding-based similarity with relevance scoring (BM25) 
  • Embeddings storage: Supports auto-generated or precomputed embeddings synced in Delta tables
  • Fast similarity retrieval: Powered by HNSW algorithm, supporting millions+ of vectors 
  • REST/SDK access: Standard and storage‑optimized endpoints via API (Python/REST) 
  • Index sync: Automatic update support for Delta tables with new or changed content 
  • Filtering & ACLs: Allows metadata filtering and permission control on search results 

How It Works (Example Flow):

  1. Create a vector index from a Delta table (with text + embeddings)
  2. Query: Input either text or an embedding + optional metadata filters
  3. Search: Runtime mixes keyword (BM25) + semantic (vector similarity) scores via Reciprocal Rank Fusion 
  4. Result: Returns ranked documents with hybrid relevance—ideal for RAG, BI, knowledge discovery

Summary:

Databricks Semantic Search (via Mosaic AI Vector Search) provides enterprise-grade hybrid search within Delta tables—automatically syncing embeddings, managing indexing, and combining semantic and keyword matching to surface more relevant results. It's an essential feature for building retrieval-augmented systems, knowledge discovery tools, or intelligent search experiences.

Resources:

  • Mosaic AI Vector Search overview (Databricks/Azure docs) – details on setup, HNSW, embeddings, endpoint types 
  • Databricks blog note on semantic search improvements in workspace assets (premium preview) 

7 - Model Serving:

Databricks Model Serving is a fully managed, production-grade system for deploying and hosting machine learning models as REST APIs. It enables users to serve models built in Databricks or imported from outside, monitor performance, and integrate seamlessly with applications or inference workflows like Retrieval-Augmented Generation (RAG), dashboards, or batch jobs.

Key Features:

Feature

Description

  • Managed infrastructure

Databricks handles autoscaling, containerization, and endpoint management

  • Real-time REST APIs

Expose ML models as secure, scalable HTTP endpoints

  • Native Unity Catalog integration

Models are versioned and managed in Unity Catalog

  • Support for multiple frameworks

Supports MLflow, scikit-learn, PyTorch, XGBoost, TensorFlow, and Hugging Face

  • GPU and LLM support

Ideal for serving LLMs, foundation models, or custom models on GPUs

  • Serverless (Recommended)

Serverless Model Serving endpoints offer low-latency, autoscaling compute without infrastructure management

Common Use Cases:

  • Powering ML-backed applications (e.g., fraud detection, recommendations)
  • Enabling chatbot LLMs or RAG pipelines
  • Integrating models into BI dashboards or web services
  • Batch scoring or real-time predictions
  • Embedding models in semantic search use cases

How It Works:

  1. Register a model in Unity Catalog using MLflow or import a model
  2. Create a serving endpoint from the Databricks UI or CLI
  3. Call the endpoint via REST API to get predictions
  4. Monitor performance (latency, throughput, errors) directly in the UI

Resources:

8 - Model Evaluation:

Databricks Model Evaluation is a feature within the Databricks Machine Learning environment that helps data scientists and ML engineers assess the performance and quality of machine learning models. It provides a standardized, automated way to compute, log, and visualize evaluation metrics, making it easier to compare models, validate behavior, and ensure models meet required standards before deployment.

Key Capabilities:

  1. Automatic Metric Logging
    • Evaluates models using standard classification and regression metrics (e.g., accuracy, F1, precision, R², RMSE)
    • Auto-logs metrics using MLflow for easy comparison
  2. Visualizations
    • Confusion matrices, ROC curves, precision-recall curves, residual plots, etc.
    • Useful for understanding model behavior across different thresholds or segments
  3. Custom Metrics Support
    • You can log your own domain-specific metrics in addition to built-in ones
  4. Multi-model Comparison
    • Evaluate and compare different model versions from the MLflow Model Registry
  5. Dataset Tracking
    • Associates evaluation results with the specific dataset used, improving reproducibility
  6. Automated Evaluation UI
    • The Databricks UI displays model evaluation results under the "Experiments" and "Model Registry" tabs

Resources:

9 - Agents:

Databricks Agents are a new capability introduced by Databricks as part of their AI and machine learning stack, designed to enable natural language interaction with data and workflows. They are essentially AI-powered assistants that can interpret user input (typically in plain English) and then act on it by writing code, querying data, or triggering workflows using the underlying data infrastructure in Databricks.

What Are Databricks Agents?

  • AI-driven applications that interact with data and tools within the Databricks platform
  • Built on top of Databricks Foundation Model APIs and Vector Search
  • Capable of performing tasks like querying data, building visualizations, running pipelines, or even generating new code—all from natural language instructions

Key Features:

Feature

Description

  • Natural Language to SQL

Users can ask data questions in plain English, and the agent generates SQL or PySpark to retrieve the answer

  • Integration with Lakehouse

Agents have access to Unity Catalog, Delta Tables, ML models, and notebooks

  • RAG-powered Intelligence

They use Retrieval Augmented Generation (RAG) by pulling in context from documentation, notebooks, or other resources

  • Extensible Workflows

You can define tools (functions, APIs, SQL snippets) the agent can use to complete tasks.

  • Multi-modal Interaction

Interact via chat interfaces, dashboards, or embedded tools

Example Use Case:

Imagine a user types - “Show me the revenue trend for Q2 2024 for our top 5 performing regions.”

A Databricks Agent could:

  • Parse the question
  • Generate SQL or PySpark to retrieve the necessary data
  • Visualize the result as a chart
  • Optionally explain the trend or compare with previous quarters

Architecture Overview:

Databricks Agents are built using:

  • Databricks Model Serving (to host and run LLMs)
  • Vector Search (to provide domain knowledge context)
  • Function Calling APIs (to execute workflows or queries)
  • Unity Catalog (for governed data access)

Resources:

10 - Unity Catalog:

Databricks Unity Catalog is a unified governance solution for all data and AI assets in the Databricks Lakehouse Platform. It provides fine-grained access control, data lineage, and centralized metadata management across workspaces, data, and compute environments—ensuring consistent security and compliance for your entire data estate.

Key Concepts of Unity Catalog:

Concept

Description

  • Catalog

A top-level container for schemas (databases). Think of it as a namespace for organizing data assets

  • Schema (Database)

A container within a catalog that holds tables, views, and functions

  • Tables & Views

Structured data assets stored in Delta Lake or other formats

  • Lineage

Automatically tracks how data flows across queries, jobs, and dashboards

  • Access Control

Role-based and attribute-based access policies using ANSI SQL GRANT, REVOKE

  • Tags & Classifications

Helps label sensitive data (e.g., PII) and enforce compliance

Why Use Unity Catalog?

  • Centralized governance across all workspaces and regions
  • Secure and fine-grained access at the table, column, row, and function levels
  • Auditability with built-in support for data lineage and access logs
  • Integration with Identity Providers like Azure AD, Okta, or SCIM
  • Support for Multiple Data Sources (Delta Lake, external tables, ML models, notebooks)

Security & Compliance Features:

  • Column-level masking (e.g., mask emails or credit card numbers)
  • Dynamic data access policies (based on user attributes)
  • Data classification and tagging (for PII, confidential info)
  • Logging and audit trails

Supported Assets:

  • Delta tables & views
  • Machine learning models
  • Functions and stored procedures
  • Files and folders in cloud storage (via Volume support)
  • Notebooks (metadata only)
  • Dashboards (metadata only)

Multi-Workspace & Cross-Cloud:

  • Works across multiple Databricks workspaces
  • Supports multi-cloud environments (AWS, Azure, GCP)
  • Ensures consistent policies and roles across all environments

Resources:

Databricks Documentation: Unity Catalog Overview

11 - Serverless Compute:

Databricks Serverless Compute is a fully managed compute option that automatically provisions and scales compute resources for SQL, notebooks, dashboards, and jobs—without requiring users to manage clusters. It simplifies infrastructure management, reduces cost through automatic scaling, and improves performance by using optimized hardware.

Key Features of Databricks Serverless Compute:

Feature

Description

  • No cluster management

Users don’t need to create, configure, or manage clusters manually

  • Auto-scaling

Automatically scales resources up or down based on the workload

  • Fast startup

Compute resources are ready in seconds—ideal for ad hoc queries or interactive notebooks

  • Optimized performance

Databricks manages compute infrastructure to ensure optimal performance, often using faster hardware and tuning

  • Pay-per-use pricing

You pay only for what you use, down to the second, making it cost-efficient

Use Cases:

  • Serverless SQL Warehouses: For analysts running BI tools or SQL queries
  • Notebook Jobs: For data scientists or engineers running ML or data processing pipelines
  • Dashboards: For live, interactive dashboards with fast load times
  • Ad hoc workloads: Great for unpredictable workloads or users who want to focus only on their work, not infrastructure

Benefits:

  • Simplicity: No DevOps overhead; ideal for business analysts and data teams
  • Speed: Instant compute provisioning; reduces latency for short queries or jobs
  • Efficiency: More cost-effective due to auto-scaling and granular billing
  • Security: Integrates with Unity Catalog and platform-wide access controls

Security and Governance:

  • Integrated with Unity Catalog for fine-grained access control
  • Secure data access through private link (if configured)
  • Audit logs of compute and query execution

Resources:

 

We hope that you find this information useful in your transition from Databricks Community to Free Edition and that you get to fully leverage the expanded capabilities offered by Databricks Free Edition.

 

Contributors