Research - Statistical Methods for Complex Survey Data

An Overview

Survey research is a multi-disciplinary activity involving a wide variety of disciplines and expertise including subject matter specialists, questionnaire designers, statisticians, survey managers, interviewers and computer specialists. Analysis of the resulting dataset furthers research in the health and social sciences, which in turn affects policy. This proposal focuses on the various complexities of a common data type, the sample survey, with potential to impact the entire survey process, research in substantive areas, and subsequent policy.

Here is a brief description of the typical survey process. Once the objectives of the survey are made clear the subject matter specialist provides general questions that address the survey's objectives. A questionnaire designer then translates these general questions into a format that can be easily understood by a potential respondent. The questionnaire becomes the measuring instrument used to obtain the data from the survey. Once the design of the questionnaire is complete, the sample of potential respondents is chosen by some random mechanism. In view of competing concerns regarding the cost to run the survey and the statistical efficiency of the data that are collected, the sample may be obtained in an economically efficient way, but may result in a complicated data structure. After the sample has been selected, the questionnaire is delivered by mail, telephone, personal interview or via the Internet. This is carried out under the direction of a survey manager and this aspect of the process includes the training and supervision of the interviewers. For a wide variety of reasons, the data file that is returned once data collection is complete will almost certainly contain missing values. Statisticians are involved in the development of the sampling design, and in the analysis of the resulting data; the research of this MITACS team focuses on the issues that arise in data analysis. It is important to note, however, that the analysis of the data is closely tied to the way in which the data have been collected.

Statistics Canada surveys and similar surveys managed by Westat Inc, as well as other social surveys, have increasingly complex structures. Most methods of data analysis have been developed for cross-sectional scenarios where the data are collected at only a single time point. Although they may be described as having complex structures, cross-sectional surveys are simpler than some new surveys currently being run. Taking cross-sectional data as the standard fare from the past, there are two new and important complexities in current surveys that are increasingly prevalent: the introduction of time, and the introduction of space. Longitudinal data involve a set of repeated observations on an individual, or group of individuals followed through time. It now comes in many diverse forms, from multi-wave panel surveys, to pooled time-series and event histories, and has the advantage of adding a greater causal interpretation to variables that are observed to be associated in time. If time adds a horizontal component to the cross-sectional approach, giving context to the present, space adds a vertical layer, adding embeddedness in social contexts as an additional concern. Time and space methods occupy the developing frontier in social science and health research.

Many newer surveys done by Statistics Canada collect longitudinal data. One prominent example is the National Longitudinal Survey of Children and Youth. The data from this survey, along with similar surveys, are housed at Statistics Canada Research Data Centres (RDCs) across Canada. Statistics Canada has identified a pressing need for new methodologies in view of these ongoing longitudinal data collection efforts. The connection to the RDCs is an important one. They are located on university campuses and are important sources of data for subject matter researchers. Further methodological issues arise when subject matter researchers try to link data from their own databases to data available through the RDCs. Some of the team members have been closely involved in current operating RDCs or in proposed new ones. With these kinds of connections there is the opportunity for cross-fertilization of ideas as well as technology transfer with subject matter researchers who also use these RDCs. Mary Thompson was one of the key people in bringing an RDC to the University of Waterloo. Jamie Stafford is a member of the committee for the University of Toronto's RDC. David Bellhouse is currently on a committee comprised of social scientists and health care researchers that is in the final stages of obtaining approval for the building of an RDC at the University of Western Ontario. Currently there are nine RDCs operating in Canada with more being planned.

Research Activities

The purpose of this project is to further the research and development of methodological tools in complex data structures arising from surveys by bringing together academics who have both methodological and subject matter research interests in complex data structures, and researchers from Statistics Canada, Westat Inc and the Toronto Rehabilitation Institute who have either a research interest or are at the front line working with these data structures on a daily basis. There are three overlapping areas that we will pursue related to the analysis of a common data type surveys with complex structures. It is the various complexities of this data type that have resulted in the following focused projects. Any particular dataset could involve all, or some subset, of these complexities requiring that they be simultaneously addressed in an analysis. These are: (A) modelling a process arising in a complex data structure; (B) variance, correlation structure and their estimation; and (C) the handling of missing data. Several sub-projects are briefly described below as they relate to each of these three general areas. All sub-projects that are being explored have direct application to complex surveys currently run by Statistics Canada.

A. New models

Data that are collected on an individual may be obtained at a variety of levels. At the basic level is information on each individual respondent. At another level a respondent may belong to a group and information can be collected on the group that is relevant to the respondent. As an example, Statistics Canada's National Longitudinal Survey of Children and Youth collects data on children, and often a second level of information on each child's family, information that is relevant to all children in the family. Likewise, for a child attending school there might be information on the child's teacher, and further information on the school. In modelling the dynamics of children's development one can use the data obtained at the level of the child, or use the data obtained at several levels of aggregation (multi-level modelling). Here are two projects related to the modelling of processes using survey data. The first project will use a single level of modeling time, while the second is related to multi-level modeling space and time.

i. Modelling of correlated durations (spells) and life history transitions using longitudinal survey data (Bellhouse, Lawless, Sutradhar)

Many types of behaviour over time tend increasingly to be regarded as movements at random intervals from one state to another. Examples include movements of individuals between different states of an illness, and movements of individuals between various labour market states. There are several open research issues pertinent to the analysis of these kinds of data: 1) How to model the processes of transition and duration in the presence of drop-out or attrition, when the individuals are clustered due to a complex sample design; 2) How to accommodate interval censored duration data, where the censoring arises through the superposition of the process of data collection on the process of spell duration. 3) How to develop mixed models to accommodate between subject variability, and the combination of model-based and sampling-based analysis.

Within Statistics Canada, Georgia Roberts and Milorad Kovacevic have already initiated some work in this area. Prospects for collaborative research are excellent; for example Sutradhar and Kovacevic have already published jointly on the topic of longitudinal data analysis (Biometrika, 2000).

ii. Multi-level modelling (Escobar, Lou, Reid)

Recent concerns with the hierarchical model involve the notion of multiple nesting units in social research the cross-nested random effects model. This model attempts to incorporate the more complex reality presented by multiple social or health contexts when an individual is embedded in multiple contexts at once, each potentially having its own influence, but the contexts are not hierarchically related. Instead, they occur at the same level of reality. One example occurs in considering influences on child development and life course options, where both school and neighborhood may have a determining role. The problem comes from the fact that schools typically will draw students across neighborhoods, and children in the same neighborhoods may attend numerous schools: this is cross-nesting. Separating these roles is essential and important, but the methods that can address this issue are still in development. Another example of cross-nesting stems from cross-appointed physicians treating subjects at two or more of the treatment sites, leading to some degree of correlation among providers across centers. Methods for dealing with such correlations within a hierarchical data structure have yet to be developed to properly examine what patient, physician, and center factors in addition to the intervention underlie any changes that might occur in the outcomes of interest.

If the final level in a hierarchical model is time, further complexities arise when the hierarchy in not preserved but changes in time due to migration, policy etc. Current methods ignore part of the structure, for example, the hierarchy and focus only on time. Approaches that account for the longitudinal structure and spatial structure, that itself changes in time, are needed.

Efforts are being made to develop collaborative research in this area. As part of the team's activities, a session at the Statistical Society of Canada meetings in 2004 on the topic of multi-level modeling was organized by Roland Thomas. Speakers were J. N. K. Rao (Carleton University), Emmanuel Behnin (Statistics Canada) and Danny Pfeffermann (University of Southampton).

The group is collaborating with the Statistical and Applied Mathematical Sciences Institute (SAMSI) on a theme year in Latent Variable Models in the Social Sciences (LVMSS). See www.samsi..info for a link to planned activities. SAMSI held a kick-off workshop for the LVMSS theme year September 11-14, 2004. The first day featured tutorials on Structural Equation Modelling and Multilevel Modelling.

B. Variance estimation

i. Estimation and analysis of dependencies with complex sampling designs (Rao, Stafford, Thompson)

Variances and covariances between different durations or spells (described, for example, in A.i) on different individuals have a complex structure. They can be further complicated in the analysis of ordered multiple spells. For example, an individual may contribute spells to several strata where we may allow the effects of covariates to be different in these strata. Statistical inference about these effects needs to allow for the overlapping of the strata. Model-specific research on statistical properties is needed in the presence of complex covariance structures of the above type. Moreover, variance estimation methods based on linearization, jackknife, one-step jackknife, bootstrap, estimating equation bootstrap, require development.

In addition to event history modelling, longitudinal surveys are used for other purposes such as gross flows estimation, the elimination of effects of latent variables in linear regression models using individual changes between consecutive time points, the modelling of marginal means of responses as functions of co-variables, and conditional modelling of the response at a given time point as a function of past responses and present and past co-variables.

ii. Algorithms for the creation of replication variance estimator (Chen, Sitter, Wu)

Both Statistics Canada and Westat Inc produce public access sample files that have been stripped of identifiers in order to protect respondent confidentiality. With the identifiers missing it is impossible to calculate valid variance estimates without further information. One solution is to use a replication variance estimator such as the bootstrap and provide the user only with the replication weights. The use of replication methods for variance estimation in complex surveys is highly computer intensive. However, creating a set of replication weights in replication methods for variance estimation can be viewed as a related design of experiments problem, and there are now, in the field of computer experiments, a number of sophisticated algorithms to construct large designs quickly and automatically. The adaptation and development of algorithms for design of experiments toward automatic methods for quick creation of replication weights for use in variance estimation for complex surveys will be investigated. In addition, there are difficulties in developing such replication methods which adequately mask strata and cluster identifiers.

C. Missing data

Not everyone who is asked participates in a sample survey. Of those who do respond some may not provide answers to all the questions in the survey. The first situation is called unit nonresponse and the second is item nonresponse. Our research proposal is directed to item nonresponse. Since survey data files almost invariably contain missing data due to item nonresponse this is a topic of interest to all users of survey data as well as to those who produce the data, Statistics Canada and Westat included. Usually item nonresponse occurs because of factors such as fatigue in filling out a long survey or in lack of understanding a question. In other cases a kind of item nonresponse is planned in advance of the survey. Here the questionnaire is designed so that a single respondent is not asked all the questions in order to reduce response burden, but over the entire set of respondents all the questions appear. Of the proposals below, (i) deals with problems related to the usual type of item nonresponse and (ii) is related to a type of planned nonresponse.

i. Swiss cheese missing data (Rao, Sitter)

The standard method for item nonresponse is to impute the missing data. As the pattern of item nonresponse becomes more complex the imputation problem becomes more challenging. These complex item non-response patterns are often called Swiss cheese holes. Statistics Canada and Westat would like to be able to build information into the data sets that are made available to researchers, which would allow users to seamlessly compute variances that account for imputation.

ii. Item response theory (Thomas)

It is of interest to adapt item response theory and other psychometric methods for scaling and scoring to survey data. This is of direct concern to Statistics Canada in some of their educational surveys. Much of the test data in these surveys may be missing by design and must be stochastically imputed several times. Each stochastic imputation yields a set of plausible values, which may be used for estimation. A problem is to compute variances for this procedure that incorporate both the stochastic modelling as well as the survey design.