Hi @Kaz1,
I understand the frustration -- the error message is misleading because you ARE already using display(dataframe), which is the correct syntax. Let me explain what is actually happening and how to work around it.
WHAT IS HAPPENING
When you create a visualization (chart) from a display() output and then click "Aggregate over more data", Databricks attempts to perform backend aggregation. This is a server-side operation that re-runs your query to aggregate across the entire dataset, not just the first 10,000 rows shown in the results table.
The key thing to understand is that display() in a Python cell works differently from a SQL cell when it comes to this backend aggregation feature. When you run display(df) in Python, the notebook renders the first 10,000 rows (or 2 MB, whichever is lower) as a table result. The initial chart visualization you create from this works fine because it operates on those cached rows. However, when you click "Aggregate over more data," the system tries to re-execute the query with a server-side aggregation layer -- and this is where Python-generated DataFrames can hit limitations, because the backend aggregation engine may not be able to re-derive or re-execute the DataFrame lineage the same way it can with a SQL query.
THE RECOMMENDED SOLUTION
The most reliable approach is to perform the aggregation yourself in PySpark or Spark SQL before calling display(). This way, you are not relying on the notebook visualization layer to do it for you.
For example, instead of displaying all rows and letting the chart aggregate:
# Instead of this (raw data, relying on chart aggregation):
display(df)
# Do this (aggregate first, then display):
from pyspark.sql import functions as F
aggregated_df = df.groupBy("category_column").agg(
F.sum("value_column").alias("total_value"),
F.count("*").alias("row_count")
)
display(aggregated_df)
ALTERNATIVE: USE A SQL CELL
If you prefer to use the built-in chart aggregation (including "Aggregate over more data"), you can use a SQL cell instead. SQL cells have full support for backend aggregation in visualizations:
%sql
SELECT * FROM my_catalog.my_schema.my_table
If your data is in a DataFrame and not a table, you can create a temporary view first:
df.createOrReplaceTempView("my_temp_view")
Then in a SQL cell:
%sql
SELECT * FROM my_temp_view
The chart created from this SQL cell will fully support the "Aggregate over more data" feature.
OTHER VISUALIZATION ALTERNATIVES
You can also bypass the built-in display() visualization entirely and use Python visualization libraries directly:
import plotly.express as px
# Convert to Pandas (for smaller datasets)
pdf = df.toPandas()
fig = px.bar(pdf, x="category_column", y="value_column")
fig.show()
Or with Matplotlib:
import matplotlib.pyplot as plt
pdf = df.toPandas()
pdf.groupby("category_column")["value_column"].sum().plot(kind="bar")
plt.show()
These libraries give you full control over aggregation and rendering.
QUICK REFERENCE
- display(df) in Python cell: Limited "Aggregate over more data" support -- may error on some DataFrames (the issue you are hitting)
- SQL cell result: Fully supports "Aggregate over more data" -- best for built-in chart aggregation
- Pre-aggregate in PySpark, then display(): Most reliable approach -- you control the aggregation
- Plotly / Matplotlib: Most flexible -- you control both aggregation and rendering
DOCUMENTATION REFERENCES
- Databricks Visualizations: https://docs.databricks.com/en/visualizations/index.html
- Visualization Types: https://docs.databricks.com/en/visualizations/visualization-types.html
- Notebook Results and Limitations: https://docs.databricks.com/en/notebooks/notebook-limitations.html
Hope this helps! Let me know if you have follow-up questions.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.