Databricks Community

CookDataSol · Friday

I am fairly new to Databricks, and indeed Python, so apologies if this has been answered elsewhere but I've been unable to find it.

I have been mainly working in notebooks as opposed to the SQL editor, but coding in SQL where possible using SQL cells or the %sql magic command. I have also occasionally used SQL within spark.sql() in Python cells. To my untrained eye these appear to be two ways of doing the same thing so I'm wondering what the differences are, if any? I am guessing spark.sql gives greater flexibility to combine SQL with Python but are there any other advantages or disadvantages of either method that I should be aware of?

Thank you

Louis_Frolio · Friday

Greetings @CookDataSol

Great question—this trips up a lot of folks when starting with Databricks. In a notebook attached to an all-purpose cluster, both SQL cells (%sql) and spark.sql(...) ultimately execute the same Spark SQL engine against the notebook’s SparkSession, so results are comparable; the choice is mostly about ergonomics and how much Python you want to mix in.

Short answer

Use a dedicated SQL cell or the %sql magic when you’re writing mostly SQL, want rich result rendering, and plan to chain SQL queries in the notebook UI.
Use spark.sql(...) in a Python cell when you want the query’s result as a DataFrame you can immediately manipulate in Python, or you need to compose SQL with Python control flow and libraries.

Key differences and when it matters

Compute and engine
- On an all-purpose cluster in a notebook, both SQL cells and spark.sql run on the same SparkSession for that notebook.
- SQL cells can also target a SQL warehouse from a notebook, which is different compute optimized for SQL (and may have different configuration capabilities). Python cells do not run on a SQL warehouse.
Ergonomics and results
- SQL cells render a nice table UI and charts directly in the cell, and they automatically expose the result as an implicit DataFrame named _sqldf for use in later cells.
- spark.sql(...) returns a regular DataFrame that you assign to a variable (df = ...), which is ideal if you want to immediately call DataFrame operations, UDFs, display(df), write(), etc. SQL syntax highlighting also works inside Python strings in spark.sql calls.
Language mixing and state
- When you switch cell languages (e.g., from %python to %sql), those run in distinct REPLs; variables defined in one language aren’t automatically available in the other. Share state via tables/views, files, or widgets.
- For parameters across languages in notebooks, widgets are the standard approach; SQL can safely consume widget values via parameter markers (for example, IDENTIFIER(:db)).
Parameters and variables in SQL
- In notebooks, use widgets + parameter markers to pass values into SQL safely (for identifiers, wrap with IDENTIFIER()).
- You can also use SQL’s SET to change session-level Spark SQL settings in SQL cells; spark.conf.set(...) is the Python analog.
Parallel execution (handy for exploratory work)
- SQL cells can run in parallel with currently running commands when the notebook is attached to an interactive cluster.
Limitations/nuances to remember
- The implicit _sqldf variable from a SQL cell isn’t available if the notebook is running on a SQL warehouse, and there are version-specific limitations noted in the docs.
- Databricks SQL warehouses allow only a limited, aliased set of Spark configurations (admins can set data access configs globally). This differs from notebooks on clusters where you can set many confs per SparkSession.

Practical examples

Equivalent query in a SQL cell vs Python

SQL cell:

-- in a SQL cell or %sql
SELECT col1, COUNT(*) AS c
FROM catalog.schema.table
GROUP BY col1
ORDER BY c DESC

Python cell:

df = spark.sql("""
  SELECT col1, COUNT(*) AS c
  FROM catalog.schema.table
  GROUP BY col1
  ORDER BY c DESC
""")
display(df)

Using _sqldf from a previous SQL cell

-- SQL cell 1
SELECT * FROM catalog.schema.table LIMIT 100

# Python cell 2
# _sqldf is the DataFrame produced by the last SQL cell
display(_sqldf)

Passing parameters with widgets (safe in SQL)

# Python cell: create widgets
dbutils.widgets.dropdown("database", "main", [d.name for d in spark.catalog.listDatabases()])
dbutils.widgets.text("tbl", "my_table")

-- SQL cell: safely use widget values
SHOW TABLES IN IDENTIFIER(:database);

SELECT *
FROM IDENTIFIER(CONCAT(:database, '.', :tbl))
LIMIT 10;

Bottom line

Prefer a SQL cell or %sql when your workflow is predominantly SQL and you want great inline tables/charts, the implicit _sqldf bridge to Python, parameterized widgets, and optionally the ability to run on a SQL warehouse.
Prefer spark.sql in Python when you need to weave SQL with Python logic, immediately manipulate results as a DataFrame, or integrate with Python libraries and DataFrame APIs.

Hope this helps, Louis.