Stata has built-in commands -ptile- and -xtile- for calculating the
quantile ranks of a variable. For instance:
xtile ptile = x,nq(100)
assigns to ptile the percentile rank associated with the variable x. For
100 million observations, this took 31 minutes. A faster way is:
gen ptile = int(100*(_n-1)/_N)+1
which took only 6 minutes, assuming you do not need to restore the
original sort order. The plus and minus one move the ptiles from [0-99] to
[1-100], matching the -xtile- command, but are otherwise superfluous.
There is a potential problem with this code - equal values may be assigned
to different quantiles. That can be fixed with one line, at the expense of
increasing the variation in the size of supposedly equal quantiles:
replace ptile = ptile(-1) if x==x(-1)
There is a drop-in replacement SSC command for -xtile- called -fastxtile- that
is even faster than the DIY method, however, like -xtile- it is not byable. The
DIY method extends easily to by variables:
sort byvar x
by byvar: gen ptile = int(100*(_n-1)/_N)+1
taking advantage of _n and _N referring to position in the current by group. -egen-
helps us generalize to by variables and weights at the same time:
sort byvar x
by byvar: egen sumwgt = sum(wgt)
by byvar: gen rsum = sum(wgt)
by byvar: gen ptile = int(100*rsum/sumwgt)
Notice the two different meanings of the -sum()- function. It is a running sum in
-generate- commands and a completed sum in -egen- commands. In this case the -egen-
command added only a minute to the total time.
The speed of -xtile- and relatives is highly dependent on the number of
categories requested - the more categories the less efficient compared to DIY. With
only two categories, the techniques are roughly equal.
SSC also contains the -stile- option to -egen- and the -fastxtile-
command which may be worth looking into.
Last changed 17 September 2016 - drf