Stata for very large datasets

The analysis of very large files, such as health insurance claims, has long been the considered the preserve of SAS, because SAS could handle datasets of any size, while Stata was limited to datasets that would fit in core. In many cases a preliminary extraction has been done is SAS, followed by analysis of a smaller subset in Stata. In this note we offer suggestions for doing the extraction in Stata, eliminating the SAS step. This is followed by some suggestions for greatly reducing the run time for common operations in Stata.

It is a truism that computers are cheap and people are expensive. However, people waiting for computers are also expensive, and often a little thought put into programming can pay dividends in faster results, especially when programs are run repeatedly on datasets with tens or hundreds of million observations and take days or weeks to complete.

Links to other sources

Daniel Feenberg
feenberg@nber.org
last update 28 October 2019 by drf