cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

dataframe.display() doesn't support data aggregation

Kaz1
New Contributor

Error: 

dataframe.display() doesn't support data aggregation. Use display(dataframe) for better results in Databricks notebooks.

But I don't use dataframe.display! I use display(dataframe). This error occurs when creating a visualization in a databricks notebook. The first visualization goes fine, but when pressing 'aggregate over more data', this error occurs. 
3 REPLIES 3

AnthonyAnand
New Contributor III

@Kaz1 The reason could be that display(dataframe) behaves differently depending on whether it is showing a simple table or a visualization with server-side aggregation. When you click "Aggregate over more data," Databricks tries to re-run the underlying query with a specialized aggregation layer with all the data.

If you need to aggregate over the entire dataset, the most robust way to avoid UI errors is to let Spark handle the aggregation (with pyspark or sql) before going for the display() with visualization.




balajij8
Contributor

You can use Plotly / Matplotlib and skip the display aggregation

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @Kaz1,

I understand the frustration -- the error message is misleading because you ARE already using display(dataframe), which is the correct syntax. Let me explain what is actually happening and how to work around it.


WHAT IS HAPPENING

When you create a visualization (chart) from a display() output and then click "Aggregate over more data", Databricks attempts to perform backend aggregation. This is a server-side operation that re-runs your query to aggregate across the entire dataset, not just the first 10,000 rows shown in the results table.

The key thing to understand is that display() in a Python cell works differently from a SQL cell when it comes to this backend aggregation feature. When you run display(df) in Python, the notebook renders the first 10,000 rows (or 2 MB, whichever is lower) as a table result. The initial chart visualization you create from this works fine because it operates on those cached rows. However, when you click "Aggregate over more data," the system tries to re-execute the query with a server-side aggregation layer -- and this is where Python-generated DataFrames can hit limitations, because the backend aggregation engine may not be able to re-derive or re-execute the DataFrame lineage the same way it can with a SQL query.


THE RECOMMENDED SOLUTION

The most reliable approach is to perform the aggregation yourself in PySpark or Spark SQL before calling display(). This way, you are not relying on the notebook visualization layer to do it for you.

For example, instead of displaying all rows and letting the chart aggregate:

# Instead of this (raw data, relying on chart aggregation):
display(df)

# Do this (aggregate first, then display):
from pyspark.sql import functions as F

aggregated_df = df.groupBy("category_column").agg(
F.sum("value_column").alias("total_value"),
F.count("*").alias("row_count")
)
display(aggregated_df)


ALTERNATIVE: USE A SQL CELL

If you prefer to use the built-in chart aggregation (including "Aggregate over more data"), you can use a SQL cell instead. SQL cells have full support for backend aggregation in visualizations:

%sql
SELECT * FROM my_catalog.my_schema.my_table

If your data is in a DataFrame and not a table, you can create a temporary view first:

df.createOrReplaceTempView("my_temp_view")

Then in a SQL cell:

%sql
SELECT * FROM my_temp_view

The chart created from this SQL cell will fully support the "Aggregate over more data" feature.


OTHER VISUALIZATION ALTERNATIVES

You can also bypass the built-in display() visualization entirely and use Python visualization libraries directly:

import plotly.express as px

# Convert to Pandas (for smaller datasets)
pdf = df.toPandas()
fig = px.bar(pdf, x="category_column", y="value_column")
fig.show()

Or with Matplotlib:

import matplotlib.pyplot as plt

pdf = df.toPandas()
pdf.groupby("category_column")["value_column"].sum().plot(kind="bar")
plt.show()

These libraries give you full control over aggregation and rendering.


QUICK REFERENCE

- display(df) in Python cell: Limited "Aggregate over more data" support -- may error on some DataFrames (the issue you are hitting)
- SQL cell result: Fully supports "Aggregate over more data" -- best for built-in chart aggregation
- Pre-aggregate in PySpark, then display(): Most reliable approach -- you control the aggregation
- Plotly / Matplotlib: Most flexible -- you control both aggregation and rendering


DOCUMENTATION REFERENCES

- Databricks Visualizations: https://docs.databricks.com/en/visualizations/index.html
- Visualization Types: https://docs.databricks.com/en/visualizations/visualization-types.html
- Notebook Results and Limitations: https://docs.databricks.com/en/notebooks/notebook-limitations.html

Hope this helps! Let me know if you have follow-up questions.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.