topic Re: Displaying Pandas Dataframe in Data Engineering

Displaying Pandas Dataframe

sdaza — Wed, 30 May 2018 03:13:21 GMT

I had this issue when displaying pandas data frames. Any ideas on how to display a pandas dataframe?

display(mydataframe)
Exception: Cannot call display(<class 'pandas.core.frame.DataFrame'>)

Re: Displaying Pandas Dataframe

ricardo_portill — Wed, 30 May 2018 04:53:06 GMT

Hi @sdaza,

The display command can be used to visualize Spark data frames or image objects but not a pandas data frame. If you'd like to visualize your pandas data, I recommend using matplotlib to prep the data into a figure. Code below showing how this would work; remember to import matplotlib using the 'New Library' functionality.

import numpy as np import pandas as pd

create dummy pandas data frame for visualization

y_data = np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]])

import matplotlib.pyplot as plt

x = my_data[:,0] y = my_data[:,1]

extract subplots from the data points

ig, ax = plt.subplots() ax.plot(x, y)

Display the matplotlib Figure object to render our pandas data frame

isplay(fig)

Re: Displaying Pandas Dataframe

ricardo_portill — Thu, 31 May 2018 13:43:27 GMT

Hi @sdaza,

You can use the display command to display objects such as a matplotlib figure or Spark data frames, but not a pandas data frame. Below is code to do this using matplotlib. Within Databricks, you can also import your own visualization library and display images using native library commands (like bokeh or ggplots displays, for example). See an example here: https://docs.databricks.com/user-guide/visualizations/bokeh.html

import numpy as np import pandas as pd

create dummy pandas data frame for visualization

y_data = np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]])

import matplotlib.pyplot as plt

x = my_data[:,0] y = my_data[:,1]

extract subplots from the data points

ig, ax = plt.subplots() ax.plot(x, y)

Display the matplotlib Figure object to render our pandas data frame

isplay(fig)

Re: Displaying Pandas Dataframe

sdaza — Sat, 02 Jun 2018 14:41:18 GMT

It is so unfortunate that we cannot render pandas dataframes. In R you can display dataframes and datatables only using display(mydata).

Is there an easy way to transform pandas dataframes to pyspark dataframes?

I ran your example but I cannot plot anything in databricks using python.

I get this error. I am using pandas 0.23.0 and matplolib 2.2.2.

/databricks/python/local/lib/python2.7/site-packages/ggplot/components/smoothers.py

:4: FutureWarning: The pandas.lib module is deprecated and will be removed in a future version. These are private functions and can be accessed from pandas._libs.lib instead from pandas.lib import Timestamp

ImportError: cannot import name Timestamp---------------------------------------------------------------------------ImportError Traceback (most recent call last)<command-3750589184628757> in <module>()11 12 ## Display the matplotlib Figure object to render our pandas data frame---> 13 display(fig)/tmp/1527806120745-0/PythonShell.py in display(self, input, args, *kwargs)708 " Call help(display) for more info."709 # import ggplot is too slow, so delay it until first call of display()--> 710 import ggplot711 if input is None:712 curFigure = mpl.pyplot.gcf()/databricks/python/local/lib/python2.7/site-packages/ggplot/init.py in <module>()19 version = '0.6.8'20 ---> 21 from .qplot import qplot22 from .ggplot import ggplot23 from .components import aes

Re: Displaying Pandas Dataframe

ricardo_portill — Sat, 02 Jun 2018 19:36:53 GMT

@sdaza,

You can go from a Spark Data frame to pandas and visualize with matplotlib or from pandas to Spark data frame (separate block) using the methods below. The syntax for the pandas plot is very similar to display(<data frame>) once the plot is defined.

As far as the error above, the Timestamp type looks to have been moved to pandas instead of pandas.lib in 0.23. Are you specifying specific versions of pandas for use within databricks? This should work with pandas 0.19.2, for example, which I believe is the standard version.

Pandas to Spark data frame and using native plot

from pyspark.sql.types import *

mySchema = StructType([ StructField("lat", LongType(), True)\ ,StructField("long", LongType(), True)])

pdDF = pd.DataFrame(np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]]))

display(sqlContext.createDataFrame(pdDF, mySchema))

Spark to Pandas and plot with matplotlib

import pandas as pd import numpy as np import matplotlib.pyplot as plt

pdDF = sc.parallelize([("bob", 100), ("sally", 200)]).toDF(["firstName", "salary"]).toPandas()

plt.clf() pdDF.plot(x='firstName', y='salary', kind='bar', rot=45) display()

Re: Displaying Pandas Dataframe

sdaza — Sat, 02 Jun 2018 20:56:24 GMT

yes, I installed pandas 0.23 because I was using this feature:

qcut

duplicates : {default ‘raise’, ‘drop’}, optional

If bin edges are not unique, raise ValueError or drop non-uniques.

New in version 0.20.0.

Re: Displaying Pandas Dataframe

ricardo_portill — Fri, 08 Jun 2018 14:13:20 GMT

@sdaza, this is a compatibility issue. As a workaround, you can make a manual update to address via the code below.

# TIMESTAMP IMPORT MUST BE CHANGED WHEN USING PANDAS 23.0+
#from pandas.lib import Timestamp
# CHANGE TO THE FOLLOWING
from pandas._libs.tslibs.timestamps import Timestamp

Re: Displaying Pandas Dataframe

bscannell1_5588 — Thu, 18 Oct 2018 14:35:39 GMT

This generally seems to work for me:

display(spark.createDataFrame(df))

It would certainly be nice if this could happen automatically underneath the hood though.

Re: Displaying Pandas Dataframe

ThomasKastl — Wed, 19 Dec 2018 12:41:23 GMT

This also has a lot of overhead, it creates a spark dataframe, distributing the data just to pull it back for display. I really don't understand why databricks does not simply allow plotting pandas dataframes locally by calling display().

Re: Displaying Pandas Dataframe

AndrewSprague — Mon, 04 Nov 2019 17:23:00 GMT

If you only want to view the dataframe contents as a table, add this in a cell:

mydataframe

mydataframe.head()

Re: Displaying Pandas Dataframe

brucewaynes256 — Mon, 04 Nov 2019 18:47:59 GMT

Excellent and nice post. It will beneficial for everyone. Thanks for sharing such a wonderful post.

McAfee.com/Activate

office.com/setup

McAfee.com/Activate

Re: Displaying Pandas Dataframe

abhishekKumar — Mon, 08 Mar 2021 13:05:59 GMT

display(mydataframe.astype(str))

Re: Displaying Pandas Dataframe

Tim_Green — Tue, 07 Jun 2022 21:13:21 GMT

A simple way to get a nicely formatted table from a pandas dataframe:

displayHTML(df.to_html())

to_html has some parameters you can control the output with. If you want something less basic, try out this code that I wrote that adds scrolling and some control over the column widths (including index columns, unlike to_html). YMMV, and this might stop working if pandas changes the output of to_html.

def display_pd(df, height=300, column_widths=None, column_units='px'):
  """
  Display pandas dataframe in databricks
  @param df: the Pandas dataframe to display
  @param height: the height in pixels of the table to display
  @param column_widths: specify individual column widths as a list. If not specified, the columns are sized proportionate to the maximum length of data in each column
    Can be shorter than the total number of columns, the remaining columns will take up the reamining space proportionately
    To specify mixed CSS units, pass a list of string widths with the CSS units included, and set column_units to the empty string
  @param column_units: the CSS units of the widths passed in
  """
  import pandas as pd
 
  if not column_widths:
    # proportional widths
    import numpy as np
    len_v = np.vectorize(len)
    lengths = len_v(df.reset_index().values.astype(str)).max(axis=0)
    total = np.sum(lengths)
    column_widths = np.trunc(lengths*100/total)
    column_units='%'
  
  widths = []
  for i, c in enumerate(column_widths):
    widths.append(f'''.display_pd_table thead th:nth-child({i+1}) {{width: {c}{column_units};}}''')
    
  html = f'''
<style>
.display_pd_container {{height: {height}px; width: 100%; overflow: auto;}}
.display_pd_table {{position: sticky; top: 0; width: 100%;}}
.display_pd_table td {{overflow: hidden;}}
.display_pd_table th {{overflow: hidden; vertical-align: top;}}
{chr(10).join(widths)}
.display_pd_table thead {{position: -webkit-sticky; position: sticky; top: 0px; z-index: 100; background-color: rgb(255, 255, 255);}}
</style> 
<div class="display_pd_container">{df.to_html(classes='display_pd_table')}</div>'''
  
  displayHTML(html)

then simply call:

display_pd(df)

Performance:

I don't recommend calling this on a huge dataframe.

Doing this:

display(spark.createDataFrame(df))

took 4.64 minutes to display 74 rows (lots of overhead passing the data to worker nodes then collecting them again, just for display).

Using the display_pd method, the same rows are displayed in 0.03 seconds.

Using displayHTML(df.to_html()), results are displayed in 0.02 seconds.