Re: Displaying Pandas Dataframe

ricardo_portill · ‎06-02-2018

@sdaza,

You can go from a Spark Data frame to pandas and visualize with matplotlib or from pandas to Spark data frame (separate block) using the methods below. The syntax for the pandas plot is very similar to display(<data frame>) once the plot is defined.

As far as the error above, the Timestamp type looks to have been moved to pandas instead of pandas.lib in 0.23. Are you specifying specific versions of pandas for use within databricks? This should work with pandas 0.19.2, for example, which I believe is the standard version.

Pandas to Spark data frame and using native plot

from pyspark.sql.types import *

mySchema = StructType([ StructField("lat", LongType(), True)\ ,StructField("long", LongType(), True)])

pdDF = pd.DataFrame(np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]]))

display(sqlContext.createDataFrame(pdDF, mySchema))

Spark to Pandas and plot with matplotlib

import pandas as pd import numpy as np import matplotlib.pyplot as plt

pdDF = sc.parallelize([("bob", 100), ("sally", 200)]).toDF(["firstName", "salary"]).toPandas()

plt.clf() pdDF.plot(x='firstName', y='salary', kind='bar', rot=45) display()