PySpark DataFrame: Select all but one or a set of columns
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-21-2016 10:27 PM
In SQL select, in some implementation, we can provide select -col_A to select all columns except the col_A.
I tried it in the Spark 1.6.0 as follows:
For a dataframe df with three columns col_A, col_B, col_C
df.select('col_B, 'col_C') # it works
df.select(-'col_A') # does not work
df.select(*-'col_A') # does not work
Note, I am trying to find the alternative of df.context.sql("select col_B, col_C ... ") in above script.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-24-2016 10:08 PM
I don't think it is supported since it is not sql standard.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-19-2017 03:57 AM
cols = list(set(df.columns) - {'col_A'})
df.select(cols)
@Sohel Khan , @zjffdu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-25-2020 04:21 PM
@sk777, @zjffdu, @Lejla Metohajrova
if your columns are time-series ordered OR you want to maintain their original order... use
cols = [c for c in df.columns if c != 'col_A']
df[cols]

