<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Access struct elements inside dataframe? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30035#M21716</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi Schnee -&lt;/P&gt;
&lt;P&gt;In this case I would use the explode operator in Dataframes.&lt;/P&gt;
&lt;P&gt;With explode you can take an array and apply an operation on all elements in the array.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 23 Oct 2015 16:45:17 GMT</pubDate>
    <dc:creator>rlgarris</dc:creator>
    <dc:date>2015-10-23T16:45:17Z</dc:date>
    <item>
      <title>Access struct elements inside dataframe?</title>
      <link>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30034#M21715</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:&lt;/P&gt;
&lt;P&gt;--------------&lt;/P&gt;
&lt;P&gt;import org.apache.spark.mllib.linalg.{Vector,Vectors}&lt;/P&gt;
&lt;P&gt;import org.apache.spark.mllib.regression.LabeledPoint&lt;/P&gt;
&lt;P&gt;case class Obs(f1:Double, f2:Double, price:Array[String])&lt;/P&gt;
&lt;P&gt;val obs1 =newObs(1,2,Array("USD","5.00"))&lt;/P&gt;
&lt;P&gt;val obs2 =newObs(2,1,Array("USD","3.00"))&lt;/P&gt;
&lt;P&gt;val df = sc.parallelize(Seq(obs1,obs2)).toDF()&lt;/P&gt;
&lt;P&gt; df.printSchema df.show()&lt;/P&gt;
&lt;P&gt;val labeled = df.map(row =&amp;gt;LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble,Vectors.dense(row.getDouble(0), row.getDouble(1)))) &lt;/P&gt;
&lt;P&gt; labeled.take(2).foreach(println) &lt;/P&gt;
&lt;P&gt;--------------------&lt;/P&gt;
&lt;P&gt;When I run this, I get this (and a bit more):&lt;/P&gt;
&lt;P&gt;df: org.apache.spark.sql.DataFrame=[f1: double, f2: double, price: array&amp;lt;string&amp;gt;]&lt;/P&gt;
&lt;P&gt;"price" is an array of string.&lt;/P&gt;
&lt;P&gt;I also get a class cast exception&lt;/P&gt;
&lt;P&gt;java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;&lt;/P&gt;
&lt;P&gt;which is probably do to the 'println', but also probably means that I'm not getting the 2nd element of the 'price' structure. &lt;/P&gt;
&lt;P&gt;Help?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Oct 2015 13:07:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30034#M21715</guid>
      <dc:creator>schnee1</dc:creator>
      <dc:date>2015-10-23T13:07:48Z</dc:date>
    </item>
    <item>
      <title>Re: Access struct elements inside dataframe?</title>
      <link>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30035#M21716</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi Schnee -&lt;/P&gt;
&lt;P&gt;In this case I would use the explode operator in Dataframes.&lt;/P&gt;
&lt;P&gt;With explode you can take an array and apply an operation on all elements in the array.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Oct 2015 16:45:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30035#M21716</guid>
      <dc:creator>rlgarris</dc:creator>
      <dc:date>2015-10-23T16:45:17Z</dc:date>
    </item>
    <item>
      <title>Re: Access struct elements inside dataframe?</title>
      <link>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30036#M21717</link>
      <description>&lt;P&gt;@schnee​&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here's an example: &lt;A href="https://forums.databricks.com/questions/893/how-do-i-explode-a-dataframe-column-containing-a-c.html" alt="https://forums.databricks.com/questions/893/how-do-i-explode-a-dataframe-column-containing-a-c.html" target="_blank"&gt;https://forums.databricks.com/questions/893/how-do-i-explode-a-dataframe-column-containing-a-c.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Oct 2015 17:16:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30036#M21717</guid>
      <dc:creator>cfregly</dc:creator>
      <dc:date>2015-10-23T17:16:08Z</dc:date>
    </item>
    <item>
      <title>Re: Access struct elements inside dataframe?</title>
      <link>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30037#M21718</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;thanks @cfregly. I think I must need a lot more remedial learning on Scala.&lt;/P&gt;
&lt;P&gt;The reference you provided works great, but when I try to translate it into my problem via:&lt;/P&gt;
&lt;P&gt;val dfExploded = df.explode(df("price")) { case Row(pr: Array[Row]) =&amp;gt; pr.map(pr =&amp;gt; Price(pr(0).asInstanceOf[String], pr(1).asInstanceOf[String]) ) } &lt;/P&gt;
&lt;P&gt;dfExploded.show() &lt;/P&gt;
&lt;P&gt;I wind up with exceptions.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Oct 2015 18:34:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30037#M21718</guid>
      <dc:creator>schnee1</dc:creator>
      <dc:date>2015-10-23T18:34:27Z</dc:date>
    </item>
    <item>
      <title>Re: Access struct elements inside dataframe?</title>
      <link>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30038#M21719</link>
      <description>&lt;P&gt;@schnee​&amp;nbsp;&lt;/P&gt;&lt;P&gt;ha! nah, this is an unnecessarily verbose and complex way of doing a fairly common transformation. &lt;/P&gt;&lt;P&gt;it would be nice to have a df.explodeArray() method.&lt;/P&gt;&lt;P&gt;anyway, what type of exceptions are you seeing?&lt;/P&gt;</description>
      <pubDate>Sat, 24 Oct 2015 00:54:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30038#M21719</guid>
      <dc:creator>cfregly</dc:creator>
      <dc:date>2015-10-24T00:54:24Z</dc:date>
    </item>
    <item>
      <title>Re: Access struct elements inside dataframe?</title>
      <link>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30039#M21720</link>
      <description>&lt;P&gt;I wound up getting past it with something like:&lt;/P&gt; 
&lt;P&gt;val assembler = new VectorAssembler() .setInputCols(Array("f1", "f2")) .setOutputCol("features") &lt;/P&gt;
&lt;P&gt;val labeled = assembler.transform(df) .select($"price".getItem(1).cast("double"), $"features") .map{case Row(price: Double, features: Vector) =&amp;gt; LabeledPoint(price, features)}&lt;/P&gt;
&lt;P&gt;which is seems much less verbose (h/t stackoverflow) and directly "promotes" the struct's elements to where I need them. &lt;/P&gt;
&lt;P&gt;I also wound up getting past the exceptions (the were, IIRC, match exceptions). &lt;/P&gt;
&lt;P&gt;Thanks for the lean-in.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 24 Oct 2015 14:20:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30039#M21720</guid>
      <dc:creator>schnee1</dc:creator>
      <dc:date>2015-10-24T14:20:21Z</dc:date>
    </item>
    <item>
      <title>Re: Access struct elements inside dataframe?</title>
      <link>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30040#M21721</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;For a bit more detail, this sort of worked:&lt;/P&gt;
&lt;P&gt;val dfExploded = df.explode(df("price")) { case Row(pr: WrappedArray[String]) =&amp;gt; pr.map(pr =&amp;gt; Price(pr(0).toString, pr(1).toString) ) } &lt;/P&gt;
&lt;P&gt;dfExploded.show()&lt;/P&gt;
&lt;P&gt;(I had to use "WrappedArray" instead of "Array" to get past the exceptions)&lt;/P&gt;
&lt;P&gt;but the output had some problems (char limits in this forum are forcing be to be terse)&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 24 Oct 2015 15:50:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30040#M21721</guid>
      <dc:creator>schnee1</dc:creator>
      <dc:date>2015-10-24T15:50:58Z</dc:date>
    </item>
    <item>
      <title>Re: Access struct elements inside dataframe?</title>
      <link>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30041#M21722</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;@schnee&lt;/P&gt;
&lt;P&gt;It is clear from the exception that row.get(2) is of WrappedArray object. It is because Array is DataType of ArrayType. All ArrayType objects are stored as WrappedArray[Any]. So, to retrieve price, do row.get(2).asInstanceOf[Array[String]].&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Jun 2016 06:07:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/30041#M21722</guid>
      <dc:creator>yuga</dc:creator>
      <dc:date>2016-06-25T06:07:41Z</dc:date>
    </item>
    <item>
      <title>Re: Access struct elements inside dataframe?</title>
      <link>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/39560#M27014</link>
      <description>&lt;P&gt;Thanks,&amp;nbsp;&lt;A title="Golden Triangle Tour" href="https://www.toursgoldentriangle.com/" target="_blank" rel="noopener"&gt;Golden Triangle Tour&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 11 Aug 2023 03:26:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/access-struct-elements-inside-dataframe/m-p/39560#M27014</guid>
      <dc:creator>goldentriangle</dc:creator>
      <dc:date>2023-08-11T03:26:34Z</dc:date>
    </item>
  </channel>
</rss>

