Stata -collapse- is slow

Collapse

-collapse- is not very fast. The author no doubt surmised that even if would be used with large datasets, it wouldn't be inside a loop. But sometimes it is, and it can become the rate limiting step in a seriously long-running program. It is easily replaced with faster code, but the total benefit isn't as great as one would hope.

Suppose we have monthly income for 5 million individuals, and wish to aggregate to annual income. We could write:

collapse (sum) month_inc,by(personid) and find that the collapse command does about .45 seconds/million observations. -collapse- is multi-threaded, and three cores were in use for a brief period. Alternatively, we could do the work "by hand": by personid: gen annual_inc = sum(month_inc) by personid: keep if _n==_N Of course, this assumes the data are already sorted, but that is commonly the case. The -sum- function in a -generate- statement does the cumulative sum, starting over at zero for each new by group. (The -egen- statement is different). This takes about .016 seconds/million for the first line, which would seem like a big win, but the -keep- statement takes another .22 seconds, so the overall speed is only twice as fast. -keep- may be doing a lot of data movement. If -save- ever allowed an -if- qualifier, there could be a substantial savings.

Maximum, minimum etc are easlily done:

by personid: gen max_inc = max(month_inc,max_inc) by personid: gen min_inc = min(month_inc,max_inc) Quartiles, means, standard deviation require just a bit more code.

Sergio Correia's -fcollapse- command in the -ftools- SSC package apparently achieves a similar degree of improvement, and keeps the -collapse- syntax.

Mauricio Caceres Bravo has written a C-language (partial) replacement for -collapse- which is part of the -gtools- and may be much faster than -ftools- but is much slower than Stata collapse.

Last update 28 october 2019 by drf