topic Re: How to force pandas_on_spark plots to use all dataframe data? in Data Engineering

How to force pandas_on_spark plots to use all dataframe data?

DavideCagnoni — Fri, 11 Feb 2022 09:09:31 GMT

When I load a table as a `pandas_on_spark` dataframe, and try to e.g. scatterplot two columns, what I obtain is a subset of the desired points.

For example, if I try to plot two columns from a table with 1000000 rows, I only see some of the data - it looks like the first 1000, but maybe I am swayed from the spark dataframe behavior with the `display` function which states to be using only the first 1000 rows if the table has more.

Is it possible to either force the plot to show all the data, or to at least know how much data out of the total amount is being plot?

Re: How to force pandas_on_spark plots to use all dataframe data?

Anonymous — Fri, 11 Feb 2022 16:22:34 GMT

Hello, @Davide Cagnoni - It's nice to meet you! My name is Piper, and I'm a moderator for the community. Thank you for bringing this question to us. Let's give your peers a chance to respond and we'll come back if we need to.

Re: How to force pandas_on_spark plots to use all dataframe data?

DavideCagnoni — Mon, 21 Feb 2022 15:57:24 GMT

@Kaniz Fatma I need to use plotly in order to be able to interact with the graph (zoom in etc.) so this doesn't solve my problem...

Re: How to force pandas_on_spark plots to use all dataframe data?

User16255483290 — Wed, 02 Mar 2022 15:24:49 GMT

@Davide Cagnoni

It's a limitation in data bricks notebooks it can't talk interactively with graphs.

Re: How to force pandas_on_spark plots to use all dataframe data?

DavideCagnoni — Thu, 03 Mar 2022 08:07:53 GMT

@Kaniz Fatma The problem is not about performance or plotly. It is about the pandas_on_spark dataframe arbitrarily subsampling the input data when plotting, without notifying the user about it.

While subsampling is comprehensible and maybe even necessary sometimes, at least a notification like the one present when you `display(table)` could be useful.