cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to append new column values in dataframe behalf of unique id's

supriya
New Contributor II

I need to create new column with data in dataframe.

Example:

val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), (14L, "hadoop mapreduce"))).toDF("id", "text")

val tuples = List((0L, 0.9), (4L, 3.0),(6L, 0.12), (7L, 0.7), (11L, 0.15), (12L, 6.1), (13L, 1.8)) val rdd: RDD[(Long, Double)] = sparkContext.parallelize((tuples.toSeq))

This tuples value is ID and AVERAGE. Now I want to add new column named Average and add value for all the rows behalf of ID and genrate a new Dataframe or RDD.

12 REPLIES 12

raela
New Contributor III
New Contributor III

Are you trying to add a new column to tuples?

You would first have to convert tuples into a DataFrame, and this can be easily done:

val tuplesDF = tuples.toDF("id", "average")

Then you can use withColumn to create a new column:

tuplesDF.withColumn("average2", tuplesDF.col("average") + 10)

Refer to the DataFrame documentation here:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

supriya
New Contributor II

Thanx @Raela Wang​  . But my requirement is different, i want to add Average column in test dataframe behalf of id column. I know this one is possible using join ...but I think join process is too slow. If you have any other solution then you can suggest me.

you have given the method to copy the values of an existing column to a newly created column, but @supriya​  has asked a different question.

raela
New Contributor III
New Contributor III

@supriya

you will have to do a join.

import org.apache.spark.sql.functions._
val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")

jackAKAkarthik
New Contributor III

@Raela Wang​  how can i add a timestamp to every row in the dataframe dynamically.

val date = new java.util.Date

val AppendDF = existingDF.withColumn("new_column_name",Column date)

Is not working for me.

Can you help?

@jack AKA karthik: For adding a timestamp in dataframe dynamically:

import org.apache.spark.sql.functions._
val AppendDF = customerDF.withColumn("new_column_name",current_timestamp())

I think it's work for you.

@supriya​ 

thanks for the help. It worked.

@supriya​ 

how can i cast this current_timestamp() in to a string type as my hive version is lower(0.13) and not able to load time stamp in to the table as it is.

@Raela Wang​ 

How can i convert current_timestamp() to a string in scala, I have tried a few but no luck.

raela
New Contributor III
New Contributor III

@jack karthik What have you tried? Have you tried cast()?

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

df.select(df("colA").cast("string"))

jackAKAkarthik
New Contributor III

@Raela Wang​ 

yes i used this after i posted the question, forgot to update.

jackAKAkarthik
New Contributor III

@Raela Wang​ 

I have used

val new DF = dataframe.withColumn("Timestamp_val",current_timestamp())

added a new column to an existing dataframe, but the compile is throwing errors while running it with yarn,

java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:221)
        at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)

How else can we add a column, should we not create a new dataframe while adding the column?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.