โ01-22-2016 01:47 AM
I need to create new column with data in dataframe.
Example:
val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), (14L, "hadoop mapreduce"))).toDF("id", "text")val tuples = List((0L, 0.9), (4L, 3.0),(6L, 0.12), (7L, 0.7), (11L, 0.15), (12L, 6.1), (13L, 1.8)) val rdd: RDD[(Long, Double)] = sparkContext.parallelize((tuples.toSeq))
This tuples value is ID and AVERAGE. Now I want to add new column named Average and add value for all the rows behalf of ID and genrate a new Dataframe or RDD.
โ01-29-2016 09:59 AM
Are you trying to add a new column to tuples?
You would first have to convert tuples into a DataFrame, and this can be easily done:
val tuplesDF = tuples.toDF("id", "average")
Then you can use withColumn to create a new column:
tuplesDF.withColumn("average2", tuplesDF.col("average") + 10)
Refer to the DataFrame documentation here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
โ01-31-2016 05:11 AM
Thanx @Raela Wangโ . But my requirement is different, i want to add Average column in test dataframe behalf of id column. I know this one is possible using join ...but I think join process is too slow. If you have any other solution then you can suggest me.
โ01-09-2017 03:10 AM
you have given the method to copy the values of an existing column to a newly created column, but @supriyaโ has asked a different question.
โ02-02-2016 05:08 PM
@supriya
you will have to do a join.
import org.apache.spark.sql.functions._
val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")
โ01-09-2017 03:12 AM
@Raela Wangโ how can i add a timestamp to every row in the dataframe dynamically.
val date = new java.util.Date
val AppendDF = existingDF.withColumn("new_column_name",Column date)
Is not working for me.
Can you help?
โ01-09-2017 04:35 AM
@jack AKA karthik: For adding a timestamp in dataframe dynamically:
import org.apache.spark.sql.functions._
val AppendDF = customerDF.withColumn("new_column_name",current_timestamp())
I think it's work for you.
โ01-09-2017 06:22 AM
@supriyaโ
thanks for the help. It worked.
โ01-12-2017 01:40 AM
@supriyaโ
how can i cast this current_timestamp() in to a string type as my hive version is lower(0.13) and not able to load time stamp in to the table as it is.
โ01-12-2017 01:49 AM
@Raela Wangโ
How can i convert current_timestamp() to a string in scala, I have tried a few but no luck.
โ01-12-2017 07:56 AM
@jack karthik What have you tried? Have you tried cast()?
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
df.select(df("colA").cast("string"))
โ01-12-2017 08:20 AM
@Raela Wangโ
yes i used this after i posted the question, forgot to update.
โ01-17-2017 03:20 AM
@Raela Wangโ
I have used
val new DF = dataframe.withColumn("Timestamp_val",current_timestamp())
added a new column to an existing dataframe, but the compile is throwing errors while running it with yarn,
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
How else can we add a column, should we not create a new dataframe while adding the column?
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now