cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to force pandas_on_spark plots to use all dataframe data?

DavideCagnoni
Contributor

When I load a table as a `pandas_on_spark` dataframe, and try to e.g. scatterplot two columns, what I obtain is a subset of the desired points.

For example, if I try to plot two columns from a table with 1000000 rows, I only see some of the data - it looks like the first 1000, but maybe I am swayed from the spark dataframe behavior with the `display` function which states to be using only the first 1000 rows if the table has more.

Is it possible to either force the plot to show all the data, or to at least know how much data out of the total amount is being plot?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @Davide Cagnoni​ , The Ideas Portal lets you influence the Databricks product roadmap by providing feedback directly to the product team.

Use the Ideas Portal to:

  • Enter feature requests.
  • View, comment, and vote up other users’ requests.
  • Monitor the progress of your favorite ideas as the Databricks product team goes through their product planning and development process.

For a quick tutorial on submitting an idea, watch this video:

View solution in original post

8 REPLIES 8

Anonymous
Not applicable

Hello, @Davide Cagnoni​ - It's nice to meet you! My name is Piper, and I'm a moderator for the community. Thank you for bringing this question to us. Let's give your peers a chance to respond and we'll come back if we need to.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Davide Cagnoni​ , You can use matplotlib directly:

import matplotlib.pyplot as plt
plt.scatter(df['col_name_1'], df['col_name_2'])
plt.show()

DavideCagnoni
Contributor

@Kaniz Fatma​ I need to use plotly in order to be able to interact with the graph (zoom in etc.) so this doesn't solve my problem...

User16255483290
Contributor

@Davide Cagnoni​ 

It's a limitation in data bricks notebooks it can't talk interactively with graphs.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Davide Cagnoni​ ,

Note:-

Inside Databricks notebooks we recommend using Plotly Offline. Plotly Offline may not perform well when handling large datasets. If you notice performance issues, you should reduce the size of your dataset.

Source

DavideCagnoni
Contributor

@Kaniz Fatma​  The problem is not about performance or plotly. It is about the pandas_on_spark dataframe arbitrarily subsampling the input data when plotting, without notifying the user about it.

While subsampling is comprehensible and maybe even necessary sometimes, at least a notification like the one present when you `display(table)` could be useful.

Hi @Davide Cagnoni​ , Would you like to share this feedback on our Ideas Portal?

Kaniz_Fatma
Community Manager
Community Manager

Hi @Davide Cagnoni​ , The Ideas Portal lets you influence the Databricks product roadmap by providing feedback directly to the product team.

Use the Ideas Portal to:

  • Enter feature requests.
  • View, comment, and vote up other users’ requests.
  • Monitor the progress of your favorite ideas as the Databricks product team goes through their product planning and development process.

For a quick tutorial on submitting an idea, watch this video:

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!