Regressions on many subsets

Non-0verlapping subsets

If a dataset is sorted by a by variable, and you wish to run a separate and independent regression by each value of the by variable, you might want a "grouped regression" that can be supplied by the -statsby- procedure in Stata. Michael Droste has his -Statsby- for fast execution of grouped regression, collecting the results in a .dta file.

Overlapping subsets

Statsby and regressby don't do a rolling regression, where a separate regression is run for each observation, based on a fixed number of prior observations. You might try something like: generate smpl = 0 forvalues i = 1/`=_N' { replace smpl = 0 replace smpl = 1 if inrange(_n,_n-9,_n) by smpl: reg y x if smpl } This does about 60 regressions/second from a dataset with 100,000 observations. There is a -rolling- command that does rolling regressions in one line. For example the following one-liner will run a separate regression of y on x for each observation in the dataset and save the estimated coefficients as a replacement for the original data. The data for each regression will include that observation and the previous 9: tsset n rolling _b[_con] _b[x] ,window(10) clear : regress y x With Statamp on 8 cores it runs about 40 regressions per second with 1 independent variable. While running on our system it kept 6 or 7 cores busy for the entire run. Note that the regressions/second number is nearly proportional to the number of observations so the typical rolling regression over (say) postwar quarterly data would be much faster - this page is uses a rolling regression only because that is a familiar example. The typical user concerned about speed would likely have a different procedure in mind.

An alternative command from SSC is -rangerun- which is more flexible than rolling (it gives you more freedom to specify the exact nature of the subset) and somewhat faster. The command sequence:

program myprog quietly { regress y x gen b_cons = _b[_cons] gen b_x = _b[x ] } end rangerun myprog, interval(n -10 0) is the near equivalent of the -rolling- command above, but runs 4 times as fast, even though it uses only one CPU. The difference is that it reports limited results for the first 9 observations while -rolling- only reports results where it has a full window available.

Nevertheless, the primary advantage of -rangerun- is the ability to specify specify subsets other than a rolling regression.

All of these times are 100-1,000 times longer than a single regression on the full dataset. Most of the additional time is spent selecting the observations to be included in each regression. In a general purpose language that selection would not require examining all 100,000 observations and a great deal of time could be saved in the creation of x`x for each regression, as 90% of the work of creating each x`x is repeated for the next regression.


last modified 14 November 2019 by feenberg@nber.org