cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

SQL cell v spark.sql in notebooks

CookDataSol
New Contributor

I am fairly new to Databricks, and indeed Python, so apologies if this has been answered elsewhere but I've been unable to find it.

I have been mainly working in notebooks as opposed to the SQL editor, but coding in SQL where possible using SQL cells or the %sql magic command. I have also occasionally used SQL within spark.sql() in Python cells. To my untrained eye these appear to be two ways of doing the same thing so I'm wondering what the differences are, if any? I am guessing spark.sql gives greater flexibility to combine SQL with Python but are there any other advantages or disadvantages of either method that I should be aware of?

Thank you

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @CookDataSol 

Great question—this trips up a lot of folks when starting with Databricks. In a notebook attached to an all-purpose cluster, both SQL cells (%sql) and spark.sql(...) ultimately execute the same Spark SQL engine against the notebook’s SparkSession, so results are comparable; the choice is mostly about ergonomics and how much Python you want to mix in.

Short answer

  • Use a dedicated SQL cell or the %sql magic when you’re writing mostly SQL, want rich result rendering, and plan to chain SQL queries in the notebook UI.

  • Use spark.sql(...) in a Python cell when you want the query’s result as a DataFrame you can immediately manipulate in Python, or you need to compose SQL with Python control flow and libraries.

Key differences and when it matters

  • Compute and engine

    • On an all-purpose cluster in a notebook, both SQL cells and spark.sql run on the same SparkSession for that notebook.

    • SQL cells can also target a SQL warehouse from a notebook, which is different compute optimized for SQL (and may have different configuration capabilities). Python cells do not run on a SQL warehouse.

  • Ergonomics and results

    • SQL cells render a nice table UI and charts directly in the cell, and they automatically expose the result as an implicit DataFrame named _sqldf for use in later cells.

    • spark.sql(...) returns a regular DataFrame that you assign to a variable (df = ...), which is ideal if you want to immediately call DataFrame operations, UDFs, display(df), write(), etc. SQL syntax highlighting also works inside Python strings in spark.sql calls.

  • Language mixing and state

    • When you switch cell languages (e.g., from %python to %sql), those run in distinct REPLs; variables defined in one language aren’t automatically available in the other. Share state via tables/views, files, or widgets.

    • For parameters across languages in notebooks, widgets are the standard approach; SQL can safely consume widget values via parameter markers (for example, IDENTIFIER(:db)).

  • Parameters and variables in SQL

    • In notebooks, use widgets + parameter markers to pass values into SQL safely (for identifiers, wrap with IDENTIFIER()).

    • You can also use SQL’s SET to change session-level Spark SQL settings in SQL cells; spark.conf.set(...) is the Python analog.

  • Parallel execution (handy for exploratory work)

    • SQL cells can run in parallel with currently running commands when the notebook is attached to an interactive cluster.
  • Limitations/nuances to remember

    • The implicit _sqldf variable from a SQL cell isn’t available if the notebook is running on a SQL warehouse, and there are version-specific limitations noted in the docs.

    • Databricks SQL warehouses allow only a limited, aliased set of Spark configurations (admins can set data access configs globally). This differs from notebooks on clusters where you can set many confs per SparkSession.

Practical examples

  • Equivalent query in a SQL cell vs Python

    SQL cell:

    -- in a SQL cell or %sql
    SELECT col1, COUNT(*) AS c
    FROM catalog.schema.table
    GROUP BY col1
    ORDER BY c DESC
     
     

    Python cell:

    df = spark.sql("""
      SELECT col1, COUNT(*) AS c
      FROM catalog.schema.table
      GROUP BY col1
      ORDER BY c DESC
    """)
    display(df)
     
     
  • Using _sqldf from a previous SQL cell

    -- SQL cell 1
    SELECT * FROM catalog.schema.table LIMIT 100
     
     
    # Python cell 2
    # _sqldf is the DataFrame produced by the last SQL cell
    display(_sqldf)
     
     
  • Passing parameters with widgets (safe in SQL)

    # Python cell: create widgets
    dbutils.widgets.dropdown("database", "main", [d.name for d in spark.catalog.listDatabases()])
    dbutils.widgets.text("tbl", "my_table")
     
     
    -- SQL cell: safely use widget values
    SHOW TABLES IN IDENTIFIER(:database);
    
    SELECT *
    FROM IDENTIFIER(CONCAT(:database, '.', :tbl))
    LIMIT 10;
     
     

Bottom line

  • Prefer a SQL cell or %sql when your workflow is predominantly SQL and you want great inline tables/charts, the implicit _sqldf bridge to Python, parameterized widgets, and optionally the ability to run on a SQL warehouse.

  • Prefer spark.sql in Python when you need to weave SQL with Python logic, immediately manipulate results as a DataFrame, or integrate with Python libraries and DataFrame APIs.

     

Hope this helps, Louis.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now