â02-11-2022 01:09 AM
When I load a table as a `pandas_on_spark` dataframe, and try to e.g. scatterplot two columns, what I obtain is a subset of the desired points.
For example, if I try to plot two columns from a table with 1000000 rows, I only see some of the data - it looks like the first 1000, but maybe I am swayed from the spark dataframe behavior with the `display` function which states to be using only the first 1000 rows if the table has more.
Is it possible to either force the plot to show all the data, or to at least know how much data out of the total amount is being plot?
â04-11-2022 11:43 PM
Hi @Davide Cagnoniâ , The Ideas Portal lets you influence the Databricks product roadmap by providing feedback directly to the product team.
Use the Ideas Portal to:
For a quick tutorial on submitting an idea, watch this video:
â02-11-2022 08:22 AM
Hello, @Davide Cagnoniâ - It's nice to meet you! My name is Piper, and I'm a moderator for the community. Thank you for bringing this question to us. Let's give your peers a chance to respond and we'll come back if we need to.
â02-21-2022 07:12 AM
Hi @Davide Cagnoniâ , You can use matplotlib directly:
import matplotlib.pyplot as plt
plt.scatter(df['col_name_1'], df['col_name_2'])
plt.show()
â02-21-2022 07:57 AM
@Kaniz Fatmaâ I need to use plotly in order to be able to interact with the graph (zoom in etc.) so this doesn't solve my problem...
â03-02-2022 07:24 AM
@Davide Cagnoniâ
It's a limitation in data bricks notebooks it can't talk interactively with graphs.
â03-02-2022 10:47 PM
Hi @Davide Cagnoniâ ,
Note:-
Inside Databricks notebooks we recommend using Plotly Offline. Plotly Offline may not perform well when handling large datasets. If you notice performance issues, you should reduce the size of your dataset.
â03-03-2022 12:07 AM
@Kaniz Fatmaâ The problem is not about performance or plotly. It is about the pandas_on_spark dataframe arbitrarily subsampling the input data when plotting, without notifying the user about it.
While subsampling is comprehensible and maybe even necessary sometimes, at least a notification like the one present when you `display(table)` could be useful.
â03-30-2022 03:05 AM
Hi @Davide Cagnoniâ , Would you like to share this feedback on our Ideas Portal?
â04-11-2022 11:43 PM
Hi @Davide Cagnoniâ , The Ideas Portal lets you influence the Databricks product roadmap by providing feedback directly to the product team.
Use the Ideas Portal to:
For a quick tutorial on submitting an idea, watch this video:
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.