- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-02-2018 12:36 PM
@sdaza,
You can go from a Spark Data frame to pandas and visualize with matplotlib or from pandas to Spark data frame (separate block) using the methods below. The syntax for the pandas plot is very similar to display(<data frame>) once the plot is defined.
As far as the error above, the Timestamp type looks to have been moved to pandas instead of pandas.lib in 0.23. Are you specifying specific versions of pandas for use within databricks? This should work with pandas 0.19.2, for example, which I believe is the standard version.
Pandas to Spark data frame and using native plot
from pyspark.sql.types import *mySchema = StructType([ StructField("lat", LongType(), True)\ ,StructField("long", LongType(), True)])
pdDF = pd.DataFrame(np.array([[5, 1], [3, 3], [1, 2], [3, 1], [4, 2], [7, 1], [7, 1]]))
display(sqlContext.createDataFrame(pdDF, mySchema))
Spark to Pandas and plot with matplotlib
import pandas as pd import numpy as np import matplotlib.pyplot as pltpdDF = sc.parallelize([("bob", 100), ("sally", 200)]).toDF(["firstName", "salary"]).toPandas()
plt.clf() pdDF.plot(x='firstName', y='salary', kind='bar', rot=45) display()