Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-12-2015 03:29 AM
I would like to keep only one of the columns used to join the dataframes. Using select() after the join does not seem straight forward because the real data may have many columns or the column names may not be known. A simple example below
llist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)] ddf = sqlContext.createDataFrame(llist,['name','date','duration']) print ddf.collect() up_ddf = sqlContext.createDataFrame([('alice', 100),('bob', 23)],['name','upload']) this keeps both 'name' columns when we only want a one!df = ddf.join(up_ddf, ddf.name == up_ddf.name) print ddf.collect() display( ddf.select(ddf.name, (ddf.duration/ddf.upload).alias('duration_per_upload')) )
Executing display above causes an ambiguous name error:
org.apache.spark.sql.AnalysisException: Reference 'name' is ambiguous, could be: name#8484, name#8487.
The error can be avoided by using up_ddf.name from the right-hand dataframe in the join
ddf.select(up_ddf.name, ...
but seems awkward. Is there a better method to join two dataframes and get only one 'name' column?