Access struct elements inside dataframe?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2015 06:07 AM
I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:
--------------
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Obs(f1:Double, f2:Double, price:Array[String])
val obs1 =newObs(1,2,Array("USD","5.00"))
val obs2 =newObs(2,1,Array("USD","3.00"))
val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema df.show()
val labeled = df.map(row =>LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble,Vectors.dense(row.getDouble(0), row.getDouble(1))))
labeled.take(2).foreach(println)
--------------------
When I run this, I get this (and a bit more):
df: org.apache.spark.sql.DataFrame=[f1: double, f2: double, price: array<string>]
"price" is an array of string.
I also get a class cast exception
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
which is probably do to the 'println', but also probably means that I'm not getting the 2nd element of the 'price' structure.
Help?
- Labels:
-
Dataframes
-
Mllib
-
Scala
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2015 09:45 AM
Hi Schnee -
In this case I would use the explode operator in Dataframes.
With explode you can take an array and apply an operation on all elements in the array.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2015 10:16 AM
@schnee
Here's an example: https://forums.databricks.com/questions/893/how-do-i-explode-a-dataframe-column-containing-a-c.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2015 11:34 AM
thanks @cfregly. I think I must need a lot more remedial learning on Scala.
The reference you provided works great, but when I try to translate it into my problem via:
val dfExploded = df.explode(df("price")) { case Row(pr: Array[Row]) => pr.map(pr => Price(pr(0).asInstanceOf[String], pr(1).asInstanceOf[String]) ) }
dfExploded.show()
I wind up with exceptions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2015 05:54 PM
@schnee
ha! nah, this is an unnecessarily verbose and complex way of doing a fairly common transformation.
it would be nice to have a df.explodeArray() method.
anyway, what type of exceptions are you seeing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-24-2015 07:20 AM
I wound up getting past it with something like:
val assembler = new VectorAssembler() .setInputCols(Array("f1", "f2")) .setOutputCol("features")
val labeled = assembler.transform(df) .select($"price".getItem(1).cast("double"), $"features") .map{case Row(price: Double, features: Vector) => LabeledPoint(price, features)}
which is seems much less verbose (h/t stackoverflow) and directly "promotes" the struct's elements to where I need them.
I also wound up getting past the exceptions (the were, IIRC, match exceptions).
Thanks for the lean-in.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-24-2015 08:50 AM
For a bit more detail, this sort of worked:
val dfExploded = df.explode(df("price")) { case Row(pr: WrappedArray[String]) => pr.map(pr => Price(pr(0).toString, pr(1).toString) ) }
dfExploded.show()
(I had to use "WrappedArray" instead of "Array" to get past the exceptions)
but the output had some problems (char limits in this forum are forcing be to be terse)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-24-2016 11:07 PM
@schnee
It is clear from the exception that row.get(2) is of WrappedArray object. It is because Array is DataType of ArrayType. All ArrayType objects are stored as WrappedArray[Any]. So, to retrieve price, do row.get(2).asInstanceOf[Array[String]].
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-10-2023 08:26 PM
Thanks, Golden Triangle Tour
![](/skins/images/97567C72181EBE789E1F0FD869E4C89B/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/97567C72181EBE789E1F0FD869E4C89B/responsive_peak/images/icon_anonymous_message.png)