Access struct elements inside dataframe?

schnee1
New Contributor III

I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:

--------------

import org.apache.spark.mllib.linalg.{Vector,Vectors}

import org.apache.spark.mllib.regression.LabeledPoint

case class Obs(f1:Double, f2:Double, price:Array[String])

val obs1 =newObs(1,2,Array("USD","5.00"))

val obs2 =newObs(2,1,Array("USD","3.00"))

val df = sc.parallelize(Seq(obs1,obs2)).toDF()

df.printSchema df.show()

val labeled = df.map(row =>LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble,Vectors.dense(row.getDouble(0), row.getDouble(1))))

labeled.take(2).foreach(println)

--------------------

When I run this, I get this (and a bit more):

df: org.apache.spark.sql.DataFrame=[f1: double, f2: double, price: array<string>]

"price" is an array of string.

I also get a class cast exception

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;

which is probably do to the 'println', but also probably means that I'm not getting the 2nd element of the 'price' structure.

Help?