Stata I/O with very large files

While Statamp can make estimation very fast, Amdahl's law makes the single-threaded I/O commands an increasingly large percentage of runtime. In our work datasets are often tens of gigabytes, and sometimes hundreds of gigabytes when multiple years of Medicare data are combined.


First, the good news. For example, the -use- statement takes variable lists and -if- qualifiers which can dramatically speed up input if only a fraction of the data is needed. They also reduce core usage. The varlists and -if- qualifiers provide an order of magnitude improvement in speed in typical applications here.


The -append- statement doesn't take the -if- qualifier, though it does take a varlist. If the appended files are too big for available memory you can't use:

forvalues year=2001/2010 { append med`year' if diagnosis=="ami", keep( varlist) } Note that you can -append- to an empty dataset (no need to -use- the first year and -append- the rest). It is the -if- qualifier that is the problem. A plausible but I/O intensive solution is: forvalues year=2001/2010 { clear use id diagnosis using med`year' if diagnosis=="ami" save ami`year' } forvalues year=2001/2010 { append od diagnosis using ami`year' }


In a Statalist posting David Kantor explained that -merge- reserves memory for the largest possible set of kept results, even if only a fraction match and are actually kept. In our case, this is a huge difference. The workaround is to split the -using- file into pieces, merge them individually, save and append the results.

-merge- statements are quite slow compared to -use-. Our fairly ordinary Linux boxes can read 3.4 million rows per second of 10 floats. Merging that with a single variable in the workspace runs at only a tenth that speed. If only one variable is kept (varlist), or only a tiny percentage of the using rows are kept (-keep(match)-) the speed can be partially restored to about 1.2 million rows/second. It is possible that something about the way data is stored internally makes this impossible to improve, but it is unfortunate.

Suppose there is in core a list of patients with an AMI, and you wish to merge in the doctors visits of those patients from the annual op (out-patient) files. You might hope to do something like this:

forvalues `year'=2002/2010 { merge 1:m id using op`year' }

In addition to the ambiguity about which rows to keep that can't work because after the first merge, there are duplicate ids in core (for multiple doctors visits in the first year). The best workaround I can come up with is:

forvalues `year'=2002/2010 { clear use ami merge 1:m id using op`year',keep(match) save ami`year',replace } forvalues `year'=2002/2010 { append using ami`year' } -append- allows multiple files to be concatenated, but -merge- doesn't allow them to be joined.


The -save- command is much more restricted than other I/O commands - no varlist, -if- or -in- support. So dividing a file into subsets requires rereading the file for each subset. For example instead of: forvalues state=1/50 ( save state`state' if state=`i' } prior to Stata16 we have: forvalues i=1/50 { clear use file if state==`i' save state`i' } In version 16 Frames were introduced, which partially obviate the inefficiency: forvalues i=1/50 { frame put if state==`i',into(f) frame f: save state`i' frame drop f } but this doesn't avoid scanning the full dataset 50 times, so it can be slow.

Other commands

Suprisingly the commands -infile-,-infix-, -xmlsave-, -export sasxport-, -fdause- and -fdasave- and the new -export delimited- allow -if-, -in- and a variable list, while -insheet-, -outsheet- -xmluse-, -import sasxport- and -save- don't allow any of those (the -varlist- option in -infile- does something different).. The new -import delimited- replacement for insheet has -rowrange- (which acts like a more limited -in-) and -colrange- (which acts like a varlist) which can improve throughput if only a fraction of the data are read.

I should note that the -in- qualifier on the -use- statement can be a bit of a disappointment in Stata 15. That is:

use med2009 in 1/100 doesn't stop reading at record 100. Instead it reads all 143 million records, but doesn't add any past 100 to the in-core dataset. Apparently this is a consequence of value labels and long strings being stored at the end of the data file. See this example and this this result. The -saveold- command is a way to avoid this retrogression if you control the creation of the file.

In version 16 this is fixed, so it it feasible to read a large file in chunks. See for suggestions.

Daniel Feenberg

Last update 26 December 2021