Tim_Green
New Contributor II

A simple way to get a nicely formatted table from a pandas dataframe:

displayHTML(df.to_html())

to_html has some parameters you can control the output with. If you want something less basic, try out this code that I wrote that adds scrolling and some control over the column widths (including index columns, unlike to_html). YMMV, and this might stop working if pandas changes the output of to_html.

def display_pd(df, height=300, column_widths=None, column_units='px'):
  """
  Display pandas dataframe in databricks
  @param df: the Pandas dataframe to display
  @param height: the height in pixels of the table to display
  @param column_widths: specify individual column widths as a list. If not specified, the columns are sized proportionate to the maximum length of data in each column
    Can be shorter than the total number of columns, the remaining columns will take up the reamining space proportionately
    To specify mixed CSS units, pass a list of string widths with the CSS units included, and set column_units to the empty string
  @param column_units: the CSS units of the widths passed in
  """
  import pandas as pd
 
  if not column_widths:
    # proportional widths
    import numpy as np
    len_v = np.vectorize(len)
    lengths = len_v(df.reset_index().values.astype(str)).max(axis=0)
    total = np.sum(lengths)
    column_widths = np.trunc(lengths*100/total)
    column_units='%'
  
  widths = []
  for i, c in enumerate(column_widths):
    widths.append(f'''.display_pd_table thead th:nth-child({i+1}) {{width: {c}{column_units};}}''')
    
  html = f'''
<style>
.display_pd_container {{height: {height}px; width: 100%; overflow: auto;}}
.display_pd_table {{position: sticky; top: 0; width: 100%;}}
.display_pd_table td {{overflow: hidden;}}
.display_pd_table th {{overflow: hidden; vertical-align: top;}}
{chr(10).join(widths)}
.display_pd_table thead {{position: -webkit-sticky; position: sticky; top: 0px; z-index: 100; background-color: rgb(255, 255, 255);}}
</style> 
<div class="display_pd_container">{df.to_html(classes='display_pd_table')}</div>'''
  
  displayHTML(html)

then simply call:

display_pd(df)

Performance:

I don't recommend calling this on a huge dataframe.

  • Doing this:
display(spark.createDataFrame(df))

took 4.64 minutes to display 74 rows (lots of overhead passing the data to worker nodes then collecting them again, just for display).

  • Using the display_pd method, the same rows are displayed in 0.03 seconds.

  • Using displayHTML(df.to_html()), results are displayed in 0.02 seconds.