Databricks

supriya · ‎01-22-2016

I need to create new column with data in dataframe.

Example:

val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), (14L, "hadoop mapreduce"))).toDF("id", "text")

val tuples = List((0L, 0.9), (4L, 3.0),(6L, 0.12), (7L, 0.7), (11L, 0.15), (12L, 6.1), (13L, 1.8)) val rdd: RDD[(Long, Double)] = sparkContext.parallelize((tuples.toSeq))

This tuples value is ID and AVERAGE. Now I want to add new column named Average and add value for all the rows behalf of ID and genrate a new Dataframe or RDD.

raela · ‎01-29-2016

Are you trying to add a new column to tuples?

You would first have to convert tuples into a DataFrame, and this can be easily done:

val tuplesDF = tuples.toDF("id", "average")

Then you can use withColumn to create a new column:

tuplesDF.withColumn("average2", tuplesDF.col("average") + 10)

Refer to the DataFrame documentation here:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

supriya · ‎01-31-2016

Thanx @Raela Wang . But my requirement is different, i want to add Average column in test dataframe behalf of id column. I know this one is possible using join ...but I think join process is too slow. If you have any other solution then you can suggest me.

jackAKAkarthik · ‎01-09-2017

you have given the method to copy the values of an existing column to a newly created column, but @supriya has asked a different question.

raela · ‎02-02-2016

@supriya

you will have to do a join.

import org.apache.spark.sql.functions._
val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")

jackAKAkarthik · ‎01-09-2017

@Raela Wang how can i add a timestamp to every row in the dataframe dynamically.

val date = new java.util.Date

val AppendDF = existingDF.withColumn("new_column_name",Column date)

Is not working for me.

Can you help?

supriya · ‎01-09-2017

@jack AKA karthik: For adding a timestamp in dataframe dynamically:

import org.apache.spark.sql.functions._
val AppendDF = customerDF.withColumn("new_column_name",current_timestamp())

I think it's work for you.

jackAKAkarthik · ‎01-09-2017

@supriya

thanks for the help. It worked.

jackAKAkarthik · ‎01-12-2017

@supriya

how can i cast this current_timestamp() in to a string type as my hive version is lower(0.13) and not able to load time stamp in to the table as it is.

jackAKAkarthik · ‎01-12-2017

@Raela Wang

How can i convert current_timestamp() to a string in scala, I have tried a few but no luck.

raela · ‎01-12-2017

@jack karthik What have you tried? Have you tried cast()?

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

df.select(df("colA").cast("string"))

jackAKAkarthik · ‎01-12-2017

@Raela Wang

yes i used this after i posted the question, forgot to update.

jackAKAkarthik · ‎01-17-2017

@Raela Wang

I have used

val new DF = dataframe.withColumn("Timestamp_val",current_timestamp())

added a new column to an existing dataframe, but the compile is throwing errors while running it with yarn,

java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:221)
        at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)

How else can we add a column, should we not create a new dataframe while adding the column?

Databricks

How to append new column values in dataframe behalf of unique id's

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!