A Two-Stage Approach to Missing Data: Theory and Application to Auxiliary Variables

Victoria Savalei, UBC

Abstract: Covariance structure analysis is concerned with testing hypotheses about the structure of the population covariance matrix. Applications include simultaneous equation models, factor models, and full structural equation models. In the presence of missing data, a popular ad-hoc approach to conducting such an analysis is to first obtain the saturated maximum likelihood (ML) estimate of the covariance matrix (sometimes called the ``EM covariance matrix''), and then to proceed to estimate the structured parameters treating this matrix as if it were obtained from complete data. This two-stage (TS) approach is appealing because the first stage is easily done, and the second stage reduces the problem to a familiar complete data problem. An additional advantage of the TS approach is that it allows for easy incorporation of auxiliary variables in stage 1, which may be important in predicting missingness, yet allows to completely ignore them in stage 2, reducing dimensions of the problem. The main disadvantage is that the standard errors and test statistics obtained in stage 2 will not be correct. In this talk, I will describe how to obtain correct standard errors and test statistics for the parameters obtained in Stage 2 of this approach, with both MCAR and MAR normally distributed data. I compare this approach to a direct maximum likelihood approach. While the TS approach is marginally less efficient, it performs extremely well, and its test statistic outperforms the test statistic from the direct ML approach. The TS method is recommended for use with missing data.