โ01-22-2016 01:47 AM
I need to create new column with data in dataframe.
Example:
val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), (14L, "hadoop mapreduce"))).toDF("id", "text")val tuples = List((0L, 0.9), (4L, 3.0),(6L, 0.12), (7L, 0.7), (11L, 0.15), (12L, 6.1), (13L, 1.8)) val rdd: RDD[(Long, Double)] = sparkContext.parallelize((tuples.toSeq))
This tuples value is ID and AVERAGE. Now I want to add new column named Average and add value for all the rows behalf of ID and genrate a new Dataframe or RDD.
โ01-29-2016 09:59 AM
Are you trying to add a new column to tuples?
You would first have to convert tuples into a DataFrame, and this can be easily done:
val tuplesDF = tuples.toDF("id", "average")
Then you can use withColumn to create a new column:
tuplesDF.withColumn("average2", tuplesDF.col("average") + 10)
Refer to the DataFrame documentation here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
โ01-31-2016 05:11 AM
Thanx @Raela Wangโ . But my requirement is different, i want to add Average column in test dataframe behalf of id column. I know this one is possible using join ...but I think join process is too slow. If you have any other solution then you can suggest me.
โ01-09-2017 03:10 AM
you have given the method to copy the values of an existing column to a newly created column, but @supriyaโ has asked a different question.
โ02-02-2016 05:08 PM
@supriya
you will have to do a join.
import org.apache.spark.sql.functions._
val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")
โ01-09-2017 03:12 AM
@Raela Wangโ how can i add a timestamp to every row in the dataframe dynamically.
val date = new java.util.Date
val AppendDF = existingDF.withColumn("new_column_name",Column date)
Is not working for me.
Can you help?
โ01-09-2017 04:35 AM
@jack AKA karthik: For adding a timestamp in dataframe dynamically:
import org.apache.spark.sql.functions._
val AppendDF = customerDF.withColumn("new_column_name",current_timestamp())
I think it's work for you.
โ01-09-2017 06:22 AM
@supriyaโ
thanks for the help. It worked.
โ01-12-2017 01:40 AM
@supriyaโ
how can i cast this current_timestamp() in to a string type as my hive version is lower(0.13) and not able to load time stamp in to the table as it is.
โ01-12-2017 01:49 AM
@Raela Wangโ
How can i convert current_timestamp() to a string in scala, I have tried a few but no luck.
โ01-12-2017 07:56 AM
@jack karthik What have you tried? Have you tried cast()?
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
df.select(df("colA").cast("string"))
โ01-12-2017 08:20 AM
@Raela Wangโ
yes i used this after i posted the question, forgot to update.
โ01-17-2017 03:20 AM
@Raela Wangโ
I have used
val new DF = dataframe.withColumn("Timestamp_val",current_timestamp())
added a new column to an existing dataframe, but the compile is throwing errors while running it with yarn,
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
How else can we add a column, should we not create a new dataframe while adding the column?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group