Access struct elements inside dataframe?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2015 06:07 AM
I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:
--------------
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Obs(f1:Double, f2:Double, price:Array[String])
val obs1 =newObs(1,2,Array("USD","5.00"))
val obs2 =newObs(2,1,Array("USD","3.00"))
val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema df.show()
val labeled = df.map(row =>LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble,Vectors.dense(row.getDouble(0), row.getDouble(1))))
labeled.take(2).foreach(println)
--------------------
When I run this, I get this (and a bit more):
df: org.apache.spark.sql.DataFrame=[f1: double, f2: double, price: array<string>]
"price" is an array of string.
I also get a class cast exception
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
which is probably do to the 'println', but also probably means that I'm not getting the 2nd element of the 'price' structure.
Help?