01-22-2016 01:47 AM
I need to create new column with data in dataframe.
Example:
val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), (14L, "hadoop mapreduce"))).toDF("id", "text")val tuples = List((0L, 0.9), (4L, 3.0),(6L, 0.12), (7L, 0.7), (11L, 0.15), (12L, 6.1), (13L, 1.8)) val rdd: RDD[(Long, Double)] = sparkContext.parallelize((tuples.toSeq))
This tuples value is ID and AVERAGE. Now I want to add new column named Average and add value for all the rows behalf of ID and genrate a new Dataframe or RDD.
01-29-2016 09:59 AM
Are you trying to add a new column to tuples?
You would first have to convert tuples into a DataFrame, and this can be easily done:
val tuplesDF = tuples.toDF("id", "average")
Then you can use withColumn to create a new column:
tuplesDF.withColumn("average2", tuplesDF.col("average") + 10)
Refer to the DataFrame documentation here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
01-31-2016 05:11 AM
Thanx @Raela Wang . But my requirement is different, i want to add Average column in test dataframe behalf of id column. I know this one is possible using join ...but I think join process is too slow. If you have any other solution then you can suggest me.
01-09-2017 03:10 AM
you have given the method to copy the values of an existing column to a newly created column, but @supriya has asked a different question.
02-02-2016 05:08 PM
@supriya
you will have to do a join.
import org.apache.spark.sql.functions._
val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")
01-09-2017 03:12 AM
@Raela Wang how can i add a timestamp to every row in the dataframe dynamically.
val date = new java.util.Date
val AppendDF = existingDF.withColumn("new_column_name",Column date)
Is not working for me.
Can you help?
01-09-2017 04:35 AM
@jack AKA karthik: For adding a timestamp in dataframe dynamically:
import org.apache.spark.sql.functions._
val AppendDF = customerDF.withColumn("new_column_name",current_timestamp())
I think it's work for you.
01-09-2017 06:22 AM
@supriya
thanks for the help. It worked.
01-12-2017 01:40 AM
@supriya
how can i cast this current_timestamp() in to a string type as my hive version is lower(0.13) and not able to load time stamp in to the table as it is.
01-12-2017 01:49 AM
@Raela Wang
How can i convert current_timestamp() to a string in scala, I have tried a few but no luck.
01-12-2017 07:56 AM
@jack karthik What have you tried? Have you tried cast()?
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
df.select(df("colA").cast("string"))
01-12-2017 08:20 AM
@Raela Wang
yes i used this after i posted the question, forgot to update.
01-17-2017 03:20 AM
@Raela Wang
I have used
val new DF = dataframe.withColumn("Timestamp_val",current_timestamp())
added a new column to an existing dataframe, but the compile is throwing errors while running it with yarn,
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
How else can we add a column, should we not create a new dataframe while adding the column?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group