topic Re: PySpark DataFrame: Select all but one or a set of columns in Data Engineering

PySpark DataFrame: Select all but one or a set of columns

SohelKhan — Mon, 22 Feb 2016 06:27:36 GMT

In SQL select, in some implementation, we can provide select -col_A to select all columns except the col_A.

I tried it in the Spark 1.6.0 as follows:

For a dataframe df with three columns col_A, col_B, col_C

df.select('col_B, 'col_C') # it works

df.select(-'col_A') # does not work

df.select(*-'col_A') # does not work

Note, I am trying to find the alternative of df.context.sql("select col_B, col_C ... ") in above script.

zjffdu — Thu, 25 Feb 2016 06:08:43 GMT

I don't think it is supported since it is not sql standard.

LejlaMetohajrov — Tue, 19 Dec 2017 11:57:00 GMT

cols = list(set(df.columns) - {'col_A'})

df.select(cols)

@Sohel Khan , @zjffdu

NavitaJain — Wed, 25 Mar 2020 23:21:12 GMT

@sk777, @zjffdu, @Lejla Metohajrova

if your columns are time-series ordered OR you want to maintain their original order... use

cols = [c for c in df.columns if c != 'col_A']

df[cols]