Re: matching sas proc survey means for quantiles i...

Ashwin_DSA · ‎03-19-2026

I didn’t expect to jump into this thread given my complete lack of SAS knowledge and the rather serious-looking statistics you’re working with. But I decided to treat your post as a chance to see what Genie Code could do with it... and it actually did a surprisingly good job.

Genie walked through your code, explained what it was doing, suggested a few improvements, and even pointed out some likely root causes for why you’re seeing differences between the SAS and Databricks versions.

Genie did produce a modified version of your code that it could run, but I’m not going to paste that here. I have a personal rule... I only share code snippets once I’ve tested them myself and, more importantly, fully understand what they do end-to-end. I’m not quite there yet on the statistical side, so I don’t want to give you something that I can’t confidently stand behind.

What I can share, though, are some of the insights Genie surfaced about why your results might be diverging. Hopefully, that gives you a bit more direction as you debug and refine things. Here is the summary of that.

The root cause of the SE discrepancy is that your code uses direct Taylor linearization which divides by a kernel density estimate f_hat at the quantile point. SAS proc surveymeans uses the Woodruff (1952) method instead, which avoids density estimation entirely — and that's why your SEs are systematically low.

Woodruff method (what SAS does):

Estimate Var(F̂(q̂)) — the variance of the CDF at the quantile (a proportion, no density needed)
Build a CI for the proportion: (q − z·SE_F, q + z·SE_F)
Invert through the empirical CDF to get quantile CIs
Derive SE = (q_upper − q_lower) / (2z)

It also shared some additional insights after fixing it.

Your original code estimated quantile SE via direct Taylor linearization, dividing by a kernel density estimate f_hat at the quantile point (weighted_density_finite_diff). Density estimation at a single point is notoriously unstable — if f̂ is even slightly overestimated, SE shrinks proportionally. That's why you saw ~148 instead of ~298.

And since you’ve clearly got an interesting problem here, I’d really encourage you to try Genie Code directly if you have it enabled in your workspace. Just paste in the same code and question you shared here and see what it comes back with. It’s surprisingly good at unpacking long, complex snippets like yours and explaining what’s going on (and occasionally making us humans look bad in the process).

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***