Databricks Community

supriya · ‎01-22-2016

I need to create new column with data in dataframe.

Example:

val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), (14L, "hadoop mapreduce"))).toDF("id", "text")

val tuples = List((0L, 0.9), (4L, 3.0),(6L, 0.12), (7L, 0.7), (11L, 0.15), (12L, 6.1), (13L, 1.8)) val rdd: RDD[(Long, Double)] = sparkContext.parallelize((tuples.toSeq))

This tuples value is ID and AVERAGE. Now I want to add new column named Average and add value for all the rows behalf of ID and genrate a new Dataframe or RDD.

raela · ‎01-29-2016

Are you trying to add a new column to tuples?

You would first have to convert tuples into a DataFrame, and this can be easily done:

val tuplesDF = tuples.toDF("id", "average")

Then you can use withColumn to create a new column:

tuplesDF.withColumn("average2", tuplesDF.col("average") + 10)

Refer to the DataFrame documentation here:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

supriya · ‎01-31-2016

Thanx @Raela Wang . But my requirement is different, i want to add Average column in test dataframe behalf of id column. I know this one is possible using join ...but I think join process is too slow. If you have any other solution then you can suggest me.

jackAKAkarthik · ‎01-09-2017

you have given the method to copy the values of an existing column to a newly created column, but @supriya has asked a different question.

raela · ‎02-02-2016

@supriya

you will have to do a join.

import org.apache.spark.sql.functions._
val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")

jackAKAkarthik · ‎01-09-2017

@Raela Wang how can i add a timestamp to every row in the dataframe dynamically.

val date = new java.util.Date

val AppendDF = existingDF.withColumn("new_column_name",Column date)

Is not working for me.

Can you help?

supriya · ‎01-09-2017

@jack AKA karthik: For adding a timestamp in dataframe dynamically:

import org.apache.spark.sql.functions._
val AppendDF = customerDF.withColumn("new_column_name",current_timestamp())

I think it's work for you.

jackAKAkarthik · ‎01-09-2017

@supriya

thanks for the help. It worked.

jackAKAkarthik · ‎01-12-2017

@supriya

how can i cast this current_timestamp() in to a string type as my hive version is lower(0.13) and not able to load time stamp in to the table as it is.

jackAKAkarthik · ‎01-12-2017

@Raela Wang

How can i convert current_timestamp() to a string in scala, I have tried a few but no luck.

raela · ‎01-12-2017

@jack karthik What have you tried? Have you tried cast()?

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

df.select(df("colA").cast("string"))

jackAKAkarthik · ‎01-12-2017

@Raela Wang

yes i used this after i posted the question, forgot to update.

jackAKAkarthik · ‎01-17-2017

@Raela Wang

I have used

val new DF = dataframe.withColumn("Timestamp_val",current_timestamp())

added a new column to an existing dataframe, but the compile is throwing errors while running it with yarn,

java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:221)
        at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)

How else can we add a column, should we not create a new dataframe while adding the column?

Databricks Community

How to append new column values in dataframe behalf of unique id's

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon