cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Displaying Pandas Dataframe

sdaza
New Contributor III

I had this issue when displaying pandas data frames. Any ideas on how to display a pandas dataframe?

display(mydataframe)
Exception: Cannot call display(<class 'pandas.core.frame.DataFrame'>)

12 REPLIES 12

ricardo_portill
New Contributor III

Hi @sdaza,

The display command can be used to visualize Spark data frames or image objects but not a pandas data frame. If you'd like to visualize your pandas data, I recommend using matplotlib to prep the data into a figure. Code below showing how this would work; remember to import matplotlib using the 'New Library' functionality.

import numpy as np import pandas as pd

create dummy pandas data frame for visualization

y_data = np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]])

import matplotlib.pyplot as plt

x = my_data[:,0] y = my_data[:,1]

extract subplots from the data points

ig, ax = plt.subplots() ax.plot(x, y)

Display the matplotlib Figure object to render our pandas data frame

isplay(fig)

ricardo_portill
New Contributor III

Hi @sdaza,

You can use the display command to display objects such as a matplotlib figure or Spark data frames, but not a pandas data frame. Below is code to do this using matplotlib. Within Databricks, you can also import your own visualization library and display images using native library commands (like bokeh or ggplots displays, for example). See an example here: https://docs.databricks.com/user-guide/visualizations/bokeh.html

import numpy as np import pandas as pd

create dummy pandas data frame for visualization

y_data = np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]])

import matplotlib.pyplot as plt

x = my_data[:,0] y = my_data[:,1]

extract subplots from the data points

ig, ax = plt.subplots() ax.plot(x, y)

Display the matplotlib Figure object to render our pandas data frame

isplay(fig)

sdaza
New Contributor III

It is so unfortunate that we cannot render pandas dataframes. In R you can display dataframes and datatables only using display(mydata).

Is there an easy way to transform pandas dataframes to pyspark dataframes?

I ran your example but I cannot plot anything in databricks using python.

I get this error. I am using pandas 0.23.0 and matplolib 2.2.2.

/databricks/python/local/lib/python2.7/site-packages/ggplot/components/smoothers.py

:4: FutureWarning: The pandas.lib module is deprecated and will be removed in a future version. These are private functions and can be accessed from pandas._libs.lib instead from pandas.lib import Timestamp

ImportError: cannot import name Timestamp---------------------------------------------------------------------------ImportError Traceback (most recent call last)<command-3750589184628757> in <module>()11 12 ## Display the matplotlib Figure object to render our pandas data frame---> 13 display(fig)/tmp/1527806120745-0/PythonShell.py in display(self, input, args, *kwargs)708 " Call help(display) for more info."709 # import ggplot is too slow, so delay it until first call of display()--> 710 import ggplot711 if input is None:712 curFigure = mpl.pyplot.gcf()/databricks/python/local/lib/python2.7/site-packages/ggplot/init.py in <module>()19 version = '0.6.8'20 ---> 21 from .qplot import qplot22 from .ggplot import ggplot23 from .components import aes

ricardo_portill
New Contributor III

@sdaza,

You can go from a Spark Data frame to pandas and visualize with matplotlib or from pandas to Spark data frame (separate block) using the methods below. The syntax for the pandas plot is very similar to display(<data frame>) once the plot is defined.

As far as the error above, the Timestamp type looks to have been moved to pandas instead of pandas.lib in 0.23. Are you specifying specific versions of pandas for use within databricks? This should work with pandas 0.19.2, for example, which I believe is the standard version.

Pandas to Spark data frame and using native plot

from pyspark.sql.types import *

mySchema = StructType([ StructField("lat", LongType(), True)\ ,StructField("long", LongType(), True)])

pdDF = pd.DataFrame(np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]]))

display(sqlContext.createDataFrame(pdDF, mySchema))

Spark to Pandas and plot with matplotlib

import pandas as pd import numpy as np import matplotlib.pyplot as plt

pdDF = sc.parallelize([("bob", 100), ("sally", 200)]).toDF(["firstName", "salary"]).toPandas()

plt.clf() pdDF.plot(x='firstName', y='salary', kind='bar', rot=45) display()

sdaza
New Contributor III

yes, I installed pandas 0.23 because I was using this feature:

qcut

duplicates : {default ‘raise’, ‘drop’}, optional

If bin edges are not unique, raise ValueError or drop non-uniques.

New in version 0.20.0.

ricardo_portill
New Contributor III

@sdaza, this is a compatibility issue. As a workaround, you can make a manual update to address via the code below.

# TIMESTAMP IMPORT MUST BE CHANGED WHEN USING PANDAS 23.0+
#from pandas.lib import Timestamp
# CHANGE TO THE FOLLOWING
from pandas._libs.tslibs.timestamps import Timestamp

bscannell1_5588
New Contributor II

This generally seems to work for me:

display(spark.createDataFrame(df))

It would certainly be nice if this could happen automatically underneath the hood though.

This also has a lot of overhead, it creates a spark dataframe, distributing the data just to pull it back for display. I really don't understand why databricks does not simply allow plotting pandas dataframes locally by calling display().

AndrewSprague
New Contributor II

If you only want to view the dataframe contents as a table, add this in a cell:

mydataframe

or

mydataframe.head()

brucewaynes256
New Contributor II

Excellent and nice post. It will beneficial for everyone. Thanks for sharing such a wonderful post.

McAfee.com/Activate

office.com/setup

McAfee.com/Activate

abhishekKumar
New Contributor II

display(mydataframe.astype(str))

Tim_Green
New Contributor II

A simple way to get a nicely formatted table from a pandas dataframe:

displayHTML(df.to_html())

to_html has some parameters you can control the output with. If you want something less basic, try out this code that I wrote that adds scrolling and some control over the column widths (including index columns, unlike to_html). YMMV, and this might stop working if pandas changes the output of to_html.

def display_pd(df, height=300, column_widths=None, column_units='px'):
  """
  Display pandas dataframe in databricks
  @param df: the Pandas dataframe to display
  @param height: the height in pixels of the table to display
  @param column_widths: specify individual column widths as a list. If not specified, the columns are sized proportionate to the maximum length of data in each column
    Can be shorter than the total number of columns, the remaining columns will take up the reamining space proportionately
    To specify mixed CSS units, pass a list of string widths with the CSS units included, and set column_units to the empty string
  @param column_units: the CSS units of the widths passed in
  """
  import pandas as pd
 
  if not column_widths:
    # proportional widths
    import numpy as np
    len_v = np.vectorize(len)
    lengths = len_v(df.reset_index().values.astype(str)).max(axis=0)
    total = np.sum(lengths)
    column_widths = np.trunc(lengths*100/total)
    column_units='%'
  
  widths = []
  for i, c in enumerate(column_widths):
    widths.append(f'''.display_pd_table thead th:nth-child({i+1}) {{width: {c}{column_units};}}''')
    
  html = f'''
<style>
.display_pd_container {{height: {height}px; width: 100%; overflow: auto;}}
.display_pd_table {{position: sticky; top: 0; width: 100%;}}
.display_pd_table td {{overflow: hidden;}}
.display_pd_table th {{overflow: hidden; vertical-align: top;}}
{chr(10).join(widths)}
.display_pd_table thead {{position: -webkit-sticky; position: sticky; top: 0px; z-index: 100; background-color: rgb(255, 255, 255);}}
</style> 
<div class="display_pd_container">{df.to_html(classes='display_pd_table')}</div>'''
  
  displayHTML(html)

then simply call:

display_pd(df)

Performance:

I don't recommend calling this on a huge dataframe.

  • Doing this:
display(spark.createDataFrame(df))

took 4.64 minutes to display 74 rows (lots of overhead passing the data to worker nodes then collecting them again, just for display).

  • Using the display_pd method, the same rows are displayed in 0.03 seconds.

  • Using displayHTML(df.to_html()), results are displayed in 0.02 seconds.
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.