Databricks Community

Jenni · ‎09-19-2025

Features	Community Edition	Free Edition
Notebooks	✅	✅
ML Flow	✅	✅
Ingestion	❌	✅
Jobs	❌	✅
Pipelines	❌	✅
Dashboards	❌	✅
Genie	❌	✅
Semantic Search	❌	✅
Model Serving	❌	✅
Model Evaluation	❌	✅
Agents	❌	✅
Unity Catalog	❌	✅
Clean Rooms	❌	❌
Lakebase	❌	(not yet)
Agent Bricks	❌	(not yet)
Enterprise Admin	❌	❌
Classic Compute	✅	❌
Serverless Compute	❌	✅
GPUs	BYO*	❌

* BYO = Bring Your Own

1- Ingestion:

Databricks Ingestion refers to the process of bringing data from various sources into the Databricks Lakehouse Platform so it can be processed, analyzed, or stored. It is a critical first step in building data pipelines and analytics workflows on Databricks.

Definition:

Databricks Ingestion is the method of importing structured, semi-structured, or unstructured data from internal or external sources into a Databricks workspace, where it is typically written to Delta Lake tables for reliable, scalable processing and analytics.

Ingestion Methods:

Batch Ingestion
Load data in scheduled batches (e.g., daily files or table updates)

Tools: Autoloader, COPY INTO, Databricks Jobs
Sources: S3, ADLS, GCS, on-premise databases, CSV/JSON/Parquet files

Streaming Ingestion
Real-time or near-real-time ingestion from sources like Kafka or Event Hubs

Tool: Structured Streaming in Spark
Use case: IoT, logs, real-time analytics

Autoloader

Cloud-native, scalable tool for incrementally ingesting files from cloud storage
Supports schema evolution, file notification, and backfill
Ideal for both batch and micro-batch pipelines
Docs: Autoloader

Partner Connect / Integrations
Prebuilt connectors for tools like:

Fivetran, dbt, Informatica, Airbyte, etc.
Simplifies ingesting from SaaS apps and databases

APIs and Custom Code
Use Spark APIs or Python scripts to read/write data from custom sources

Resources:

2 - Jobs:

Databricks Jobs are a feature that lets you automate and schedule your data processing workflows. A job in Databricks runs a notebook, JAR, Python script, or SQL query on a defined schedule or in response to an event — making it essential for production data pipelines, ETL, machine learning, and more.

Definition:

A Databricks Job is a reusable, configurable task that executes code (e.g., a notebook or script) on the Databricks platform according to a defined schedule, trigger, or dependency.

Core Capabilities:

*Feature*	*Description*
Task orchestration	Run multiple tasks in a sequence with dependencies (DAG-style)
Scheduling	Run jobs on a cron-like schedule or at specific intervals
Triggers	Start jobs manually, on a schedule, or via API/Webhook
Retries and alerts	Configure retry logic and failure notifications (email, webhooks, etc.)
Cluster control	Run jobs on existing, new, or serverless clusters
Job runs and logs	Monitor job history, logs, and metrics in the UI or via the API
Parameterization	Pass parameters to make jobs dynamic (e.g., different input paths/dates)

Supported Job Types:

Notebook (most common)
JAR (Scala/Java)
Python scripts
SQL queries or dashboards
dbt tasks (via workflows)

Example Use Cases:

ETL pipelines: Ingest → Transform → Load
Scheduled ML model training and evaluation
Daily reporting dashboards
Data quality checks
Refreshing feature stores or Delta Live Tables

Resources:

3 - Pipelines:

In Databricks, pipelines are automated, scalable workflows that help you build, manage, and monitor end-to-end data processing tasks — such as ETL (Extract, Transform, Load), machine learning workflows, or streaming applications. There are two main types of pipelines in Databricks:

Delta Live Tables Pipelines (DLT Pipelines)

Definition:

A Delta Live Tables pipeline is a managed pipeline that lets you define and automate reliable, production-grade data transformations using declarative SQL or Python.

Key Features:

Declarative syntax: CREATE LIVE TABLE ...AS SELCT ...
Built-in data quality checks (EXPECT statements)
Automatic dependency resolution between tables
Incremental processing (efficient for large/streaming data)
Built-in lineage tracking and monitoring
Change data capture (CDC) support

Example Use Case:

Create Bronze → Silver → Gold data layers
Ensure data quality rules before loading downstream
Automatically refresh tables daily/hourly

Resources:

Delta Live Tables Pipelines Overview

2. Job Pipelines or Workflows Pipelines

Definition:

A Databricks Jobs pipeline (aka workflow) is a sequence of tasks (e.g., notebooks, scripts, or queries) that execute in order or in parallel, similar to an orchestrated DAG.

This is often what people mean when they refer to "pipelines" in Databricks Jobs.

Key Features:

Flexible orchestration of multiple job tasks
Dependency management between tasks
Event-based or scheduled execution
Parameter passing between tasks
Integration with Git for CI/CD workflows

Example Use Case:

Step 1: Ingest raw data
Step 2: Transform and clean
Step 3: Train machine learning model
Step 4: Publish results to dashboard or database

Resources:

Pipeline Task for Jobs

Summary Table:

*Pipeline Type*	*Description*
Delta Live Tables (DLT)	For declarative data transformations (SQL/PySpark), with auto-optimizations
Jobs/Workflows Pipelines	For custom task orchestration (any code/script/tool) in a flexible DAG

4 - Dashboards:

Databricks Dashboards are visual tools within the Databricks platform that allow users to create, share, and view visualizations of data from notebooks or SQL queries. They are useful for presenting insights to stakeholders without needing them to dive into raw data or code.

What Can You Do with Databricks Dashboards?

Visualize data using bar charts, line graphs, pie charts, maps, tables, and more
Pin results from SQL queries or notebook cells directly into a dashboard
Share dashboards with team members or external users (via links or workspace permissions)
Schedule refreshes to keep data visualizations up to date
Use dashboards in full-screen mode for live presentations or wall displays

Use Cases:

KPI tracking (e.g., revenue, user growth)
Data quality monitoring
Machine learning model performance tracking
Operational metrics (e.g., system logs, ETL status)

Key Features:

*Feature*	*Description*
Live Visuals	Refresh dynamically from SQL queries or notebooks
Drag-and-Drop UI	Easy layout customization
Access Control	Workspace-level permission management
Embedded Visuals	Embeddable in apps or web portals (Enterprise only)

How to Create a Dashboard in Databricks:

Run a Query or Notebook Cell
Use SQL or Python to analyze your data
Click “+ Add to Dashboard”
On the output cell, choose the dashboard (new or existing)
Open the Dashboard Editor
Arrange, resize, and configure visuals
Set Permissions
Share with others and define view/edit rights

Resources:

5 - Genie:

Databricks Genie is an AI-powered, no-code interface that lets users—especially business or non-technical teams—interact with data using natural language. Genie translates questions into SQL queries, runs them via Databricks SQL, and returns results along with visualizations.

How Genie “Thinks”:

Compound AI system: Genie comprises multiple agents responsible for tasks like planning, natural language parsing, SQL generation, visualization, and quality checks
Guided learning: Optional human feedback—such as thumbs up/down, curated SQL examples, and table metadata—helps Genie refine its accuracy within a "space" (a curated context)

How to Use Genie:

Setup a Genie space: Link 1–25 tables or views registered in Unity Catalog
Annotate metadata: Add descriptions, synonyms, and example values to tables and columns
Provide guidance: Include example SQL queries and text instructions to shape Genie’s responses
Interact via UI: Users ask questions, review SQL generated, run queries, add visualizations, or refine through feedback—all within the same interface

Benefits & Considerations:

Empowers self-service analytics: Enables business users to get real-time insights without writing SQL
Accuracy depends on metadata: Genie is only as precise as the context provided; quality metadata is essential
Privacy safeguards: Data remains secure—row‑level data isn't visible to the model by default. Genie uses schema metadata and can optionally sample values. Moreover, Databricks with Azure OpenAI ensures that prompts and queries are not stored

Summary:

Genie bridges the gap between technical data assets and business users, offering an intelligent natural-language assistant for querying, charting, and exploring data—powered by SQL under the hood and enhanced with ongoing human-guided refinement.

Here’s a streamlined setup guide for Databricks AI/BI Genie, including environment requirements, step-by-step instructions, and best practices:

Prerequisites & Requirements:

Unity Catalog: Store your datasets as tables or views within Unity Catalog (up to 25 assets per Genie space)
Compute Resource: Use a Databricks SQL Pro or serverless SQL warehouse and ensure you have CAN USE rights
Permissions:

SQL entitlement for creator/editor roles
CAN USE on the warehouse
SELECT on relevant catalog tables
CAN EDIT on the Genie space to set it up

Create and Configure a Genie Space:

Go to the Genie sidebar, click New
Select up to 25 related tables/views from Unity Catalog
Build a Knowledge Store: add table/column descriptions, synonyms, value sampling to improve query responses
Provide clear custom instructions, example SQL queries, and optional metric views to align Genie with business logic

Best Practices for Curation:

Start small & focused: use 5–10 tables with <50 columns and clearly connected relationships
Annotate thoroughly: add descriptive column/table metadata and key definitions (PKs/FKs) for better model understanding
Iterate based on usage: monitor Genie interactions, refine instructions and example queries over time

Publish & Share:

Share the space with colleagues, ensuring users have CAN RUN, CAN USE (warehouse), and SELECT rights New chat contexts are captured in Genie; you can also duplicate spaces using Clone.

Summary Workflow:

Register tables/views in Unity Catalog
Create a Genie space and link compute resources
Curate via metadata, instructions, and example queries
Monitor & refine based on usage
Share with appropriate permissions

Resources:

6 - Semantic Search:

Databricks Semantic Search, part of the Mosaic AI Vector Search feature, enables hybrid semantic+keyword search on text data stored in Delta tables. It uses vector embeddings plus traditional keyword search, combining results via Reciprocal Rank Fusion to deliver more meaningful and context-aware retrieval.

Core Features:

Hybrid keyword-semantic search: Merges embedding-based similarity with relevance scoring (BM25) 
Embeddings storage: Supports auto-generated or precomputed embeddings synced in Delta tables
Fast similarity retrieval: Powered by HNSW algorithm, supporting millions+ of vectors 
REST/SDK access: Standard and storage‑optimized endpoints via API (Python/REST) 
Index sync: Automatic update support for Delta tables with new or changed content 
Filtering & ACLs: Allows metadata filtering and permission control on search results

How It Works (Example Flow):

Create a vector index from a Delta table (with text + embeddings)
Query: Input either text or an embedding + optional metadata filters
Search: Runtime mixes keyword (BM25) + semantic (vector similarity) scores via Reciprocal Rank Fusion 
Result: Returns ranked documents with hybrid relevance—ideal for RAG, BI, knowledge discovery

Summary:

Databricks Semantic Search (via Mosaic AI Vector Search) provides enterprise-grade hybrid search within Delta tables—automatically syncing embeddings, managing indexing, and combining semantic and keyword matching to surface more relevant results. It's an essential feature for building retrieval-augmented systems, knowledge discovery tools, or intelligent search experiences.

Resources:

Mosaic AI Vector Search overview (Databricks/Azure docs) – details on setup, HNSW, embeddings, endpoint types 
Databricks blog note on semantic search improvements in workspace assets (premium preview)

7 - Model Serving:

Databricks Model Serving is a fully managed, production-grade system for deploying and hosting machine learning models as REST APIs. It enables users to serve models built in Databricks or imported from outside, monitor performance, and integrate seamlessly with applications or inference workflows like Retrieval-Augmented Generation (RAG), dashboards, or batch jobs.

Key Features:

*Feature*	*Description*
Managed infrastructure	Databricks handles autoscaling, containerization, and endpoint management
Real-time REST APIs	Expose ML models as secure, scalable HTTP endpoints
Native Unity Catalog integration	Models are versioned and managed in Unity Catalog
Support for multiple frameworks	Supports MLflow, scikit-learn, PyTorch, XGBoost, TensorFlow, and Hugging Face
GPU and LLM support	Ideal for serving LLMs, foundation models, or custom models on GPUs
Serverless (Recommended)	Serverless Model Serving endpoints offer low-latency, autoscaling compute without infrastructure management

Common Use Cases:

Powering ML-backed applications (e.g., fraud detection, recommendations)
Enabling chatbot LLMs or RAG pipelines
Integrating models into BI dashboards or web services
Batch scoring or real-time predictions
Embedding models in semantic search use cases

How It Works:

Register a model in Unity Catalog using MLflow or import a model
Create a serving endpoint from the Databricks UI or CLI
Call the endpoint via REST API to get predictions
Monitor performance (latency, throughput, errors) directly in the UI

Resources:

Databricks Model Serving Overview (official docs)

8 - Model Evaluation:

Databricks Model Evaluation is a feature within the Databricks Machine Learning environment that helps data scientists and ML engineers assess the performance and quality of machine learning models. It provides a standardized, automated way to compute, log, and visualize evaluation metrics, making it easier to compare models, validate behavior, and ensure models meet required standards before deployment.

Key Capabilities:

Automatic Metric Logging

Evaluates models using standard classification and regression metrics (e.g., accuracy, F1, precision, R², RMSE)
Auto-logs metrics using MLflow for easy comparison

Visualizations

Confusion matrices, ROC curves, precision-recall curves, residual plots, etc.
Useful for understanding model behavior across different thresholds or segments

Custom Metrics Support

You can log your own domain-specific metrics in addition to built-in ones

Multi-model Comparison

Evaluate and compare different model versions from the MLflow Model Registry

Dataset Tracking

Associates evaluation results with the specific dataset used, improving reproducibility

Automated Evaluation UI

The Databricks UI displays model evaluation results under the "Experiments" and "Model Registry" tabs

Resources:

MLflow Evaluation

9 - Agents:

Databricks Agents are a new capability introduced by Databricks as part of their AI and machine learning stack, designed to enable natural language interaction with data and workflows. They are essentially AI-powered assistants that can interpret user input (typically in plain English) and then act on it by writing code, querying data, or triggering workflows using the underlying data infrastructure in Databricks.

What Are Databricks Agents?

AI-driven applications that interact with data and tools within the Databricks platform
Built on top of Databricks Foundation Model APIs and Vector Search
Capable of performing tasks like querying data, building visualizations, running pipelines, or even generating new code—all from natural language instructions

Key Features:

*Feature*	*Description*
Natural Language to SQL	Users can ask data questions in plain English, and the agent generates SQL or PySpark to retrieve the answer
Integration with Lakehouse	Agents have access to Unity Catalog, Delta Tables, ML models, and notebooks
RAG-powered Intelligence	They use Retrieval Augmented Generation (RAG) by pulling in context from documentation, notebooks, or other resources
Extensible Workflows	You can define tools (functions, APIs, SQL snippets) the agent can use to complete tasks.
Multi-modal Interaction	Interact via chat interfaces, dashboards, or embedded tools

Example Use Case:

Imagine a user types - “Show me the revenue trend for Q2 2024 for our top 5 performing regions.”

A Databricks Agent could:

Parse the question
Generate SQL or PySpark to retrieve the necessary data
Visualize the result as a chart
Optionally explain the trend or compare with previous quarters

Architecture Overview:

Databricks Agents are built using:

Databricks Model Serving (to host and run LLMs)
Vector Search (to provide domain knowledge context)
Function Calling APIs (to execute workflows or queries)
Unity Catalog (for governed data access)

Resources:

AI Agent Tools

10 - Unity Catalog:

Databricks Unity Catalog is a unified governance solution for all data and AI assets in the Databricks Lakehouse Platform. It provides fine-grained access control, data lineage, and centralized metadata management across workspaces, data, and compute environments—ensuring consistent security and compliance for your entire data estate.

Key Concepts of Unity Catalog:

*Concept*	*Description*
Catalog	A top-level container for schemas (databases). Think of it as a namespace for organizing data assets
Schema (Database)	A container within a catalog that holds tables, views, and functions
Tables & Views	Structured data assets stored in Delta Lake or other formats
Lineage	Automatically tracks how data flows across queries, jobs, and dashboards
Access Control	Role-based and attribute-based access policies using ANSI SQL GRANT, REVOKE
Tags & Classifications	Helps label sensitive data (e.g., PII) and enforce compliance

Why Use Unity Catalog?

Centralized governance across all workspaces and regions
Secure and fine-grained access at the table, column, row, and function levels
Auditability with built-in support for data lineage and access logs
Integration with Identity Providers like Azure AD, Okta, or SCIM
Support for Multiple Data Sources (Delta Lake, external tables, ML models, notebooks)

Security & Compliance Features:

Column-level masking (e.g., mask emails or credit card numbers)
Dynamic data access policies (based on user attributes)
Data classification and tagging (for PII, confidential info)
Logging and audit trails

Supported Assets:

Delta tables & views
Machine learning models
Functions and stored procedures
Files and folders in cloud storage (via Volume support)
Notebooks (metadata only)
Dashboards (metadata only)

Multi-Workspace & Cross-Cloud:

Works across multiple Databricks workspaces
Supports multi-cloud environments (AWS, Azure, GCP)
Ensures consistent policies and roles across all environments

Resources:

Databricks Documentation: Unity Catalog Overview

11 - Serverless Compute:

Databricks Serverless Compute is a fully managed compute option that automatically provisions and scales compute resources for SQL, notebooks, dashboards, and jobs—without requiring users to manage clusters. It simplifies infrastructure management, reduces cost through automatic scaling, and improves performance by using optimized hardware.

Key Features of Databricks Serverless Compute:

*Feature*	*Description*
No cluster management	Users don’t need to create, configure, or manage clusters manually
Auto-scaling	Automatically scales resources up or down based on the workload
Fast startup	Compute resources are ready in seconds—ideal for ad hoc queries or interactive notebooks
Optimized performance	Databricks manages compute infrastructure to ensure optimal performance, often using faster hardware and tuning
Pay-per-use pricing	You pay only for what you use, down to the second, making it cost-efficient

Use Cases:

Serverless SQL Warehouses: For analysts running BI tools or SQL queries
Notebook Jobs: For data scientists or engineers running ML or data processing pipelines
Dashboards: For live, interactive dashboards with fast load times
Ad hoc workloads: Great for unpredictable workloads or users who want to focus only on their work, not infrastructure

Benefits:

Simplicity: No DevOps overhead; ideal for business analysts and data teams
Speed: Instant compute provisioning; reduces latency for short queries or jobs
Efficiency: More cost-effective due to auto-scaling and granular billing
Security: Integrates with Unity Catalog and platform-wide access controls

Security and Governance:

Integrated with Unity Catalog for fine-grained access control
Secure data access through private link (if configured)
Audit logs of compute and query execution

Resources:

Serverless Compute on Databricks

We hope that you find this information useful in your transition from Databricks Community to Free Edition and that you get to fully leverage the expanded capabilities offered by Databricks Free Edition.

Databricks Community

Want to Learn more about all the new Features available in Free Edition - We have you Covered!

Key Features:

Example Use Case:

Resources:

2. Job Pipelines or Workflows Pipelines

Definition:

Example Use Case:

Resources:

Summary Table:

7 - Model Serving:

Key Features:

Common Use Cases:

How It Works:

Resources:

Want to Learn more about all the new Features available in Free Edition - We have you Covered!

Milking the Value of Data for the Dairy Industry!