Self-Censoring Stata Output

CMS requires that stataistical tables produced from Medicare billing data not include any results based on cells with fewer than 10 respondents. In this project we attempt to provide some Stata procedures that directly enforce this requirement and report only a missing value code for those cells. Additionally, the equivalent restriction is imposed on dummy variables in regression output.

These programs are intended to make examination of output for compliance with CMS standards faster and more reliable and to prevent inadvertant violations of the standard. They are not intended to prevent deliberate disclosures, and it is certainly possible to transmit a disclosure through this filter with determined intention. Given that the user does have access to the original micro-data, there is no incentive for him to disclose via the more complicated official release channel.

These program depend on the global macro variable $mincellsize, which defaults to 10. Any cell based on fewer than $mincellsize records is replaced with the .s missing value code.

Each table they create includes the line ".s indicates a statistic based on fewer than $mincellsize records". If no other tables are included in $release.log, then that file should be suitable for release. Exceptions would be if similar tables with cells differing by a single record were produced, or if there were published aggregates for the sample. These are things that should be obvious to any reviewer.

Some of these programs append censored output to a log file, whose name (less filetype) is given by global macro $release, and whose default value is "release". This is in addition to the any log file maintained by the user's program itself. Appending allows multiple files to be appended to a single release.log file, but it means starting a new log file is the responsibility of the user.

Programs producing non-ascii output have options and defaults of their own for directing output. These procedures do not edit program output. They either redo the computations with additional restrictions (stable and ssummtab) or edit saved output (sreg). Parsing Stata log files is difficult and unreliable.

The Programs available now

-stable-

-stable- is a modification to the Stata -table- command except that the "row" and "column" options are not supported. This is fortuitous since some suppressed cells could be reconstructed by subtracting the remaining cells from the total. -stable- places output in the user's log file and in $release.log. For example, the commands: sysuse nlsw88 stable grade, contents(mean wage iqr wage max tenure) yield the output: ------------------------------------------------- current | grade | completed | mean(wage) iqr(wage) max(tenure) ----------+-------------------------------------- 0 | .s .s .s 4 | 3.011271 .6441219 .s 5 | .s .s .s 6 | 3.82026 1.513687 .s 7 | 3.797682 1.731077 .s 8 | 5.437 2.986808 .s 9 | 5.655415 2.198068 .s 10 | 4.692721 2.553535 .s 11 | 5.688235 2.801929 .s 12 | 6.638048 3.663443 .s 13 | 8.315217 4.49275 .s 14 | 9.130599 4.806763 .s 15 | 9.885779 4.42029 .s 16 | 9.806044 5.809176 .s 17 | 10.43081 6.070848 .s 18 | 11.60784 4.609798 .s ------------------------------------------------- .s indicates fewer than 3 records Note that statistics max, min, median, first and last are always suppressed, while interquartile range is allowed. I think that conforms to the spirit of the regulations, though perhaps not the letter.

-ssummtab-

-ssummtab- is a modification of the SSC -summtab- command for creating publication quality summary statistics in Word or Excel formats, as would typically accompany a paper with regression results. If any cells are missing, all cells in that column will be missing, suggesting the table is not yet suitable for publication. -ssummtab- includes an option to name the output file but does put any results to either log as it has no option for ASCII output. For example, the commands: keep if race==2 ssummtab, by(union) mean word replace contvars(age married grade south wage hours tenure) yield the output (after conversion from xlsx to html with st):
C1 C2 C3
  nonunion union
  (N = 16) (N = 8)
age in current year Mean (SD) 39.06 (3.64) .s (.s)
married Mean (SD) 0.69 (0.48) .s (.s)
current grade completed Mean (SD) 13.44 (4.47) .s (.s)
lives in south Mean (SD) 0.19 (0.40) .s (.s)
hourly wage Mean (SD) 9.46 (5.90) .s (.s)
usual hours worked Mean (SD) 39.88 (10.50) .s (.s)
job tenure (years) Mean (SD) 4.28 (3.22) .s (.s)

sreg

-sreg- does not do regressions, but creates reports based on information stored by Stata from most (all?) regression commands. Running -sreg- after a regression command will append the results to $release.log with non-releasable factor variable parameter estimates suppressed. The suppressed variables would include _cons and any factor variable with fewer than 10 non-zero or zero values in the estimation sample. At this time -sreg- does not yet examine non-factor variables so some care is required before output is released. An enhancement to detect non-factor variable dummies is feasible. For example, the commands: sysuse nlsw88 regress wage tenure i.grade sreg yield the output: ------------------------------------------------------------------------------ wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- tenure | .1070933 .0346732 3.09 0.002 .0389881 .1751984 | grade | 4 | .s . . . . . 6 | .569052 4.819114 0.12 0.906 -8.89666 10.03476 7 | 1.07241 4.751872 0.23 0.822 -8.261224 10.00604 8 | .8121087 4.638867 0.18 0.861 -8.299561 9.923779 9 | 1.669255 4.581578 0.36 0.716 -7.329887 10.6684 10 | 1.1163 4.577486 0.24 0.807 -7.874805 10.10741 11 | 2.001984 4.554536 0.44 0.660 -6.944043 10.04801 12 | 2.413749 4.520139 0.53 0.594 -6.464716 11.29221 13 | 4.101939 4.575359 0.90 0.370 -4.884988 13.08887 14 | 5.155073 4.562176 1.13 0.259 -3.80596 14.11611 15 | 6.131421 4.622188 1.33 0.185 -2.947488 15.41033 16 | 8.011438 4.559408 1.76 0.079 -.9441576 16.96703 17 | 6.613028 4.607801 1.44 0.152 -2.437623 15.66368 18 | 8.428397 4.599476 1.83 0.067 -.6059005 17.46269 _cons | .s . . . . ------------------------------------------------------------------------------ .s indicates fewer than 3 records

Users of -outreg- or -estout- can use -sreg- to process the VCV matrix before those procedures see it. For example:

global release results1 reg y x dummy sreg outreg2 using results2 estimates store m estout m using results3,style(html) Ritchie points out that there merely removing the coeficient for the constant is actually sufficient to prevent disclosure and would be considerably cheaper, as detecting dummy variables is potentially expensive.

To Do

-table- and -summtab- are .ado files, with source supplied by Statacorp, making them feasible to modify. Only a few lines were changed in each program and it is likely that many other Stata programs can be treated similarly. Please contact me with suggestions. -sreg- works because Stata has standard return values for extimation commands, and those matricies can be modified for printing. Many programs, are Stata builtins that can not be modified, and do not return any results in machine-readable format. Those I can't do much with.

Examples

Sources and Patches

References

Felix Ritchie1 and Mark Elliot, Principles- Versus RulesBased Output Statistical Disclosure Control In Remote Access Environments https://iassistdata.org/sites/default/files/iqvol_39_2_ritchie.pdf

Felix Ritchie , Output-based disclosure control for regressions http://www2.uwe.ac.uk/faculties/BBS/BUS/Research/economics2012/1209.pdf

Felix Ritchie , Analyzing the disclosure risk of regression coefficients TRANSACTIONS ON DATA PRIVACY 12 (2019) 145???173

Bleninger P., Drechsler J., and Ronning G. (2011) Remote data access and the risk of disclosure from linear regression, Stat. and Op. Res. Trans. Special Issue: Privacy in statistical databases, pp 7-24 http://www.idescat.cat/sort/sortspecial2011/DataPrivacy.1.bleninger-etal.pdf


last modified 10 May 2020 by drf