topic Re: How to append new column values in dataframe behalf of unique id's in Data Engineering

How to append new column values in dataframe behalf of unique id's

supriya — Fri, 22 Jan 2016 09:47:07 GMT

I need to create new column with data in dataframe.

Example:

val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), (14L, "hadoop mapreduce"))).toDF("id", "text")

val tuples = List((0L, 0.9), (4L, 3.0),(6L, 0.12), (7L, 0.7), (11L, 0.15), (12L, 6.1), (13L, 1.8)) val rdd: RDD[(Long, Double)] = sparkContext.parallelize((tuples.toSeq))

This tuples value is ID and AVERAGE. Now I want to add new column named Average and add value for all the rows behalf of ID and genrate a new Dataframe or RDD.

Re: How to append new column values in dataframe behalf of unique id's

raela — Fri, 29 Jan 2016 17:59:56 GMT

Are you trying to add a new column to tuples?

You would first have to convert tuples into a DataFrame, and this can be easily done:

val tuplesDF = tuples.toDF("id", "average")

Then you can use withColumn to create a new column:

tuplesDF.withColumn("average2", tuplesDF.col("average") + 10)

Refer to the DataFrame documentation here:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Re: How to append new column values in dataframe behalf of unique id's

supriya — Sun, 31 Jan 2016 13:11:40 GMT

Thanx @Raela Wang . But my requirement is different, i want to add Average column in test dataframe behalf of id column. I know this one is possible using join ...but I think join process is too slow. If you have any other solution then you can suggest me.

Re: How to append new column values in dataframe behalf of unique id's

raela — Wed, 03 Feb 2016 01:08:32 GMT

@supriya

you will have to do a join.

import org.apache.spark.sql.functions._
val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")

Re: How to append new column values in dataframe behalf of unique id's

jackAKAkarthik — Mon, 09 Jan 2017 11:10:14 GMT

you have given the method to copy the values of an existing column to a newly created column, but @supriya has asked a different question.

Re: How to append new column values in dataframe behalf of unique id's

jackAKAkarthik — Mon, 09 Jan 2017 11:12:45 GMT

@Raela Wang how can i add a timestamp to every row in the dataframe dynamically.

val date = new java.util.Date

val AppendDF = existingDF.withColumn("new_column_name",Column date)

Is not working for me.

Can you help?

Re: How to append new column values in dataframe behalf of unique id's

supriya — Mon, 09 Jan 2017 12:35:34 GMT

@jack AKA karthik: For adding a timestamp in dataframe dynamically:

import org.apache.spark.sql.functions._
val AppendDF = customerDF.withColumn("new_column_name",current_timestamp())

I think it's work for you.

Re: How to append new column values in dataframe behalf of unique id's

jackAKAkarthik — Mon, 09 Jan 2017 14:22:48 GMT

@supriya

thanks for the help. It worked.

Re: How to append new column values in dataframe behalf of unique id's

jackAKAkarthik — Thu, 12 Jan 2017 09:40:21 GMT

@supriya

how can i cast this current_timestamp() in to a string type as my hive version is lower(0.13) and not able to load time stamp in to the table as it is.

Re: How to append new column values in dataframe behalf of unique id's

jackAKAkarthik — Thu, 12 Jan 2017 09:49:23 GMT

@Raela Wang

How can i convert current_timestamp() to a string in scala, I have tried a few but no luck.

Re: How to append new column values in dataframe behalf of unique id's

raela — Thu, 12 Jan 2017 15:56:09 GMT

@jack karthik What have you tried? Have you tried cast()?

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

df.select(df("colA").cast("string"))

Re: How to append new column values in dataframe behalf of unique id's

jackAKAkarthik — Thu, 12 Jan 2017 16:20:48 GMT

@Raela Wang

yes i used this after i posted the question, forgot to update.

Re: How to append new column values in dataframe behalf of unique id's

jackAKAkarthik — Tue, 17 Jan 2017 11:20:42 GMT

@Raela Wang

I have used

val new DF = dataframe.withColumn("Timestamp_val",current_timestamp())

added a new column to an existing dataframe, but the compile is throwing errors while running it with yarn,

java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:221)
        at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)

How else can we add a column, should we not create a new dataframe while adding the column?