Displaying Pandas Dataframe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-29-2018 08:13 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-29-2018 09:53 PM
Hi @sdaza,
The display command can be used to visualize Spark data frames or image objects but not a pandas data frame. If you'd like to visualize your pandas data, I recommend using matplotlib to prep the data into a figure. Code below showing how this would work; remember to import matplotlib using the 'New Library' functionality.
import numpy as np import pandas as pd create dummy pandas data frame for visualizationy_data = np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]])
import matplotlib.pyplot as plt
x = my_data[:,0] y = my_data[:,1]
extract subplots from the data pointsig, ax = plt.subplots() ax.plot(x, y)
Display the matplotlib Figure object to render our pandas data frameisplay(fig)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-31-2018 06:43 AM
Hi @sdaza,
You can use the display command to display objects such as a matplotlib figure or Spark data frames, but not a pandas data frame. Below is code to do this using matplotlib. Within Databricks, you can also import your own visualization library and display images using native library commands (like bokeh or ggplots displays, for example). See an example here: https://docs.databricks.com/user-guide/visualizations/bokeh.html
import numpy as np import pandas as pd create dummy pandas data frame for visualizationy_data = np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]])
import matplotlib.pyplot as plt
x = my_data[:,0] y = my_data[:,1]
extract subplots from the data pointsig, ax = plt.subplots() ax.plot(x, y)
Display the matplotlib Figure object to render our pandas data frameisplay(fig)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-02-2018 07:41 AM
It is so unfortunate that we cannot render pandas dataframes. In R you can display dataframes and datatables only using display(mydata).
Is there an easy way to transform pandas dataframes to pyspark dataframes?
I ran your example but I cannot plot anything in databricks using python.
I get this error. I am using pandas 0.23.0 and matplolib 2.2.2.
/databricks/python/local/lib/python2.7/site-packages/ggplot/components/smoothers.py
:4: FutureWarning: The pandas.lib module is deprecated and will be removed in a future version. These are private functions and can be accessed from pandas._libs.lib instead from pandas.lib import Timestamp
ImportError: cannot import name Timestamp---------------------------------------------------------------------------ImportError Traceback (most recent call last)<command-3750589184628757> in <module>()11 12 ## Display the matplotlib Figure object to render our pandas data frame---> 13 display(fig)/tmp/1527806120745-0/PythonShell.py in display(self, input, args, *kwargs)708 " Call help(display) for more info."709 # import ggplot is too slow, so delay it until first call of display()--> 710 import ggplot711 if input is None:712 curFigure = mpl.pyplot.gcf()/databricks/python/local/lib/python2.7/site-packages/ggplot/init.py in <module>()19 version = '0.6.8'20 ---> 21 from .qplot import qplot22 from .ggplot import ggplot23 from .components import aes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-02-2018 12:36 PM
@sdaza,
You can go from a Spark Data frame to pandas and visualize with matplotlib or from pandas to Spark data frame (separate block) using the methods below. The syntax for the pandas plot is very similar to display(<data frame>) once the plot is defined.
As far as the error above, the Timestamp type looks to have been moved to pandas instead of pandas.lib in 0.23. Are you specifying specific versions of pandas for use within databricks? This should work with pandas 0.19.2, for example, which I believe is the standard version.
Pandas to Spark data frame and using native plot
from pyspark.sql.types import *mySchema = StructType([ StructField("lat", LongType(), True)\ ,StructField("long", LongType(), True)])
pdDF = pd.DataFrame(np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]]))
display(sqlContext.createDataFrame(pdDF, mySchema))
Spark to Pandas and plot with matplotlib
import pandas as pd import numpy as np import matplotlib.pyplot as pltpdDF = sc.parallelize([("bob", 100), ("sally", 200)]).toDF(["firstName", "salary"]).toPandas()
plt.clf() pdDF.plot(x='firstName', y='salary', kind='bar', rot=45) display()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-02-2018 01:56 PM
yes, I installed pandas 0.23 because I was using this feature:
qcut
duplicates : {default ‘raise’, ‘drop’}, optional
If bin edges are not unique, raise ValueError or drop non-uniques.
New in version 0.20.0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-08-2018 07:13 AM
@sdaza, this is a compatibility issue. As a workaround, you can make a manual update to address via the code below.
# TIMESTAMP IMPORT MUST BE CHANGED WHEN USING PANDAS 23.0+
#from pandas.lib import Timestamp
# CHANGE TO THE FOLLOWING
from pandas._libs.tslibs.timestamps import Timestamp
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-18-2018 07:35 AM
This generally seems to work for me:
display(spark.createDataFrame(df))
It would certainly be nice if this could happen automatically underneath the hood though.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-19-2018 04:41 AM
This also has a lot of overhead, it creates a spark dataframe, distributing the data just to pull it back for display. I really don't understand why databricks does not simply allow plotting pandas dataframes locally by calling display().
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-04-2019 09:23 AM
If you only want to view the dataframe contents as a table, add this in a cell:
mydataframe
or
mydataframe.head()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-04-2019 10:47 AM
Excellent and nice post. It will beneficial for everyone. Thanks for sharing such a wonderful post.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2021 05:05 AM
display(mydataframe.astype(str))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-07-2022 02:13 PM
A simple way to get a nicely formatted table from a pandas dataframe:
displayHTML(df.to_html())
to_html has some parameters you can control the output with. If you want something less basic, try out this code that I wrote that adds scrolling and some control over the column widths (including index columns, unlike to_html). YMMV, and this might stop working if pandas changes the output of to_html.
def display_pd(df, height=300, column_widths=None, column_units='px'):
"""
Display pandas dataframe in databricks
@param df: the Pandas dataframe to display
@param height: the height in pixels of the table to display
@param column_widths: specify individual column widths as a list. If not specified, the columns are sized proportionate to the maximum length of data in each column
Can be shorter than the total number of columns, the remaining columns will take up the reamining space proportionately
To specify mixed CSS units, pass a list of string widths with the CSS units included, and set column_units to the empty string
@param column_units: the CSS units of the widths passed in
"""
import pandas as pd
if not column_widths:
# proportional widths
import numpy as np
len_v = np.vectorize(len)
lengths = len_v(df.reset_index().values.astype(str)).max(axis=0)
total = np.sum(lengths)
column_widths = np.trunc(lengths*100/total)
column_units='%'
widths = []
for i, c in enumerate(column_widths):
widths.append(f'''.display_pd_table thead th:nth-child({i+1}) {{width: {c}{column_units};}}''')
html = f'''
<style>
.display_pd_container {{height: {height}px; width: 100%; overflow: auto;}}
.display_pd_table {{position: sticky; top: 0; width: 100%;}}
.display_pd_table td {{overflow: hidden;}}
.display_pd_table th {{overflow: hidden; vertical-align: top;}}
{chr(10).join(widths)}
.display_pd_table thead {{position: -webkit-sticky; position: sticky; top: 0px; z-index: 100; background-color: rgb(255, 255, 255);}}
</style>
<div class="display_pd_container">{df.to_html(classes='display_pd_table')}</div>'''
displayHTML(html)
then simply call:
display_pd(df)
Performance:
I don't recommend calling this on a huge dataframe.
- Doing this:
display(spark.createDataFrame(df))
took 4.64 minutes to display 74 rows (lots of overhead passing the data to worker nodes then collecting them again, just for display).
- Using the display_pd method, the same rows are displayed in 0.03 seconds.
- Using displayHTML(df.to_html()), results are displayed in 0.02 seconds.