Is there a better method to join two dataframes an...

kruhly · ‎05-12-2015

I would like to keep only one of the columns used to join the dataframes. Using select() after the join does not seem straight forward because the real data may have many columns or the column names may not be known. A simple example below

llist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)] ddf = sqlContext.createDataFrame(llist,['name','date','duration']) print ddf.collect() up_ddf = sqlContext.createDataFrame([('alice', 100),('bob', 23)],['name','upload'])

this keeps both 'name' columns when we only want a one!

df = ddf.join(up_ddf, ddf.name == up_ddf.name) print ddf.collect() display( ddf.select(ddf.name, (ddf.duration/ddf.upload).alias('duration_per_upload')) )

Executing display above causes an ambiguous name error:

org.apache.spark.sql.AnalysisException: Reference 'name' is ambiguous, could be: name#8484, name#8487.

The error can be avoided by using up_ddf.name from the right-hand dataframe in the join

ddf.select(up_ddf.name, ...

but seems awkward. Is there a better method to join two dataframes and get only one 'name' column?

Is there a better method to join two dataframes and not have a duplicated column?