Good recommendation. I was able to do something similar that appears to work.
# Step 2. Compute the is outlier sessions based on duration_minutes
lc = session_agg_df.selectExpr("percentile(duration_minutes, 0.25) lower_quartile")
session_agg_df = session_agg_df.join(lc, how="outer")
uc = session_agg_df.selectExpr("percentile(duration_minutes, 0.75) upper_quartile")
session_agg_df = session_agg_df.join(uc, how="outer")
session_agg_df = session_agg_df.withColumn('iqr', session_agg_df['upper_quartile']-session_agg_df['lower_quartile'])
session_agg_df = session_agg_df.withColumn('lower_limit', session_agg_df['lower_quartile'] - (1.5 * session_agg_df['iqr']))
session_agg_df = session_agg_df.withColumn('upper_limit', session_agg_df['upper_quartile'] + (1.5 * session_agg_df['iqr']))
session_agg_df = session_agg_df.withColumn('is_outlier', f.when( (session_agg_df['duration_minutes']<session_agg_df['lower_limit']) | \
(session_agg_df['duration_minutes'] > session_agg_df['upper_limit']),1).otherwise(0))
I am sure there are more optimal ways of doing this, but the above does appear to flag the outliers (based on the IQR method) on my data. Posting incase anyone else gets stuck on this.
Miles