Mathematics of Information Technology and Complex Systems

Project Homepage
Project Highlights
Team Members
Partner Organizations

Project Highlights - Statistical Methods for Complex Survey Data

Workshop and Conference

Shortly after the initial announcement from MITACS of funding for this project, the team initiated its activities with a workshop, entitled Workshop on Statistical Methods for Complex Surveys. The workshop was held at the Centre de recherches mathétiques in Montréal April 30 - May 2, 2003 and was funded by the National Program for Complex Data Structures (NPCDS). There were about 50 participants and 12 speakers. The workshop themes were directly related to the science as described in this proposal: (1) variance estimation for complex without replacement sampling designs; (2) modeling of correlated duration data from longitudinal surveys; (3) multi-level modeling of survey data; and (4) item response theory for surveys. On the first day of the workshop researchers from Statistics Canada made presentations of four complex surveys run by that organization. The focus of these presentations was on data analytic problems arising from the complexity of these surveys. On the second and third days researchers, both statisticians and subject matter specialists, made presentations that spoke directly to the workshop theme areas. There were two international speakers, Chris Skinner of the University of Southampton and Dick Wiggins of City University in London, England. Several team members also made presentations.

Since 2004, the team has established a tradition for holding invited sessions at annual meetings of Statistical Soceity of Canada (SSC). One of the major features of these invited sessions is that many of the invited speakers are students who conducted research through the project's internship programs. They include Irene Lu, Norberto Pantoja and Qunshu Ren at SSC 2005, Ivan Carrillo, Xiaojian Xu and Chunfang Lin at SSC 2006, Taslim Mallick, Michelle Zhou and Cindy Feng at SSC 2008, Zhijian Chen, Dagmar Mariaca-Hajducek and Yan Liu at SSC 2009.

Since 2006, the project has run a seminar series, with research presentations held at UBC, SFU, York, University of Toronto, Waterloo, University of Windsor, University of Montreal, and Statistics Canada. These presentations are given by team members, including Professors Jiahua Chen, David Haziza, J.N.K. Rao and Changbao Wu, and researchers from project partner organizations, including Drs Milorad Kovacevic and Georgia Roberts of Statistics Canada.

Research Activities and Outcomes Related to Internship Programs

Central to this project is the pairing of academic researchers and their doctoral students and/or postdoctoral fellows with researchers at Statistics Canada, Westat Inc and the Toronto Rehabilitation Institute. Internship positions are advertised nationally for placements at Statistics Canada and internship students are selected based on the merits of their research proposals and background qualifications. Students spend four-to-six months at the partner organizations, identify and formulate research problems through discussions among the student, his/her assigned supervisor(s) at the partner organization, and his/her academic supervisor(s) at the home university. In most cases the research project continues after the internship and becomes an important part of the student's thesis research work.

By April 2010, this MITACS project has completed twenty student internship programs, among them fifteen are at Statistics Canada. There are two additonal students who are currently holding internship positions at Statistics Canada during Fall 2010 (September - December). Completed internship research projects to date are outlined below:

1. Irene Lu: PhD student of Roland Thomas at Carleton University; Intern at Statistics Canada from January to May 2004. Research Project: Embedding IRT in Structural Equation Models.

A natural question arising from preliminary work of Rolland Thomas on parameter estimation techniques for linear models featuring latent variables was the extent to which score prediction would bias regression parameters obtained when scores for children's math and reading ability based on data from the National Longitudinal Survey of Children and Youth (NLSCY) database, were used by analysts in secondary research studies. Irene Lu recently completed her PhD on related topics under Dr. Thomas' supervision and is now an Assistant Professor at York U. Using a combination of simulation and calculation, she showed that the biases in regression parameters could be very severe in realistic situations when a two-step approach was used (i.e., first estimate the Item Response Theory (IRT) scores, then use the IRT scores in regression). She showed that IRT provided little advantage over other commonly used naïve methods of score construction. This effect, referred to as finite-item bias in her work, is present in any two-step method and disappears only as the number of items goes to infinity. One of her main contributions was to utilize the connection between IRT and discrete structural equation modeling (IRT-SEM) to obtain consistent regression estimates free of finite item bias. During her Statistics Canada residency period on a MITACS/NPCDS Internship in conjunction with the seed project to this proposal, supervised by Harold Mantel, she replicated her results using data from the Youth in Transition Survey (YITS), a complex sample survey. Her YITS analysis took full account of the sampling weights to obtain design-consistent estimators using the IRT-SEM approach, and by comparing these consistent parameter estimates to those obtained using a weighted two-step regression, she determined the extent of the finite item bias in her YITS examples.

2. Wilson Lu: PhD student of Randy Sitter at Simon Fraser University; Intern at Westat Inc. in 2004. Research Project: Replication Variance Estimation Methods and Confidentiality Issues with Discolosure of Public Use Survey Data.

In this project, Wilson Lu worked with three researchers from Werstat Inc., Mike Brick, Leyla Mohadjer, Sylvia Dorhmann and team member Randy Sitter and developed novel methods to handle issues surrounding non-disclosure of confidential information in public releases of survey data. Specifically, an important issue in surveys is the conflict of interest between information sharing and disclosure of personal information. Statistical agencies routinely release data for public use with some information suppressed for confidentiality. If care is not taken, one can (partially) reconstruct the stratum and/or cluster indicators and thus break confidentiality (i.e., one may be able to identify a specific company or, say, AIDS patient).  The research team was able to demonstrate these dangers of current approaches and propose a new approach - using scheduling theory algorithms - to reduce the risk of breaches in confidentiality, while providing consistent variance estimates. This technique has now been implemented in the US national health and nutrition examination survey, with significant impact on public policy.

3. Norberto Pantoja Galicia: PhD student of Mary Thompson at University of Waterloo; Intern at Statistics Canada from September 2004 to February 2005. Research Project: Bivariate Density Estimation for INterval Censored Data from Complex Surveys.

Longitudinal surveys allow the observation of durations, but because successive interviews are separated in time, the endpoints are often bounded within intervals rather than recorded precisely. A PhD student, Norberto Pantoja Galicia, is working with Mary Thompson on the problem of making inference about the order of two event times when the times are interval censored at random. Let T1 and T2 be the times of events of interest, for example corresponding to becoming pregnant and smoking cessation respectively. Thompson and Pantoja Galicia (2002) propose a formal nonparametric test for order. This test involves the estimation of the survivor functions of T1 and T2 from the data, as well as the joint distribution of (T1, T2-T1). Inspired by the ideas of Duchesne and Stafford (2001, Tech. Rep. 0106, U. of T.) and Braun, Duchesne and Stafford (2004, Can. J. Statist., to appear), Thompson and Pantoja Galicia have developed an R program that deals with bivariate density estimation for interval censored data. The next step, is being carried out at Statistics Canada where Norberto Pantoja Galicia is currently visiting on a MITACS/NPCDS (National Program on Complex Data Structures) Internship related to the seed project to this proposal, is to make the appropriate adaptation to this program when the data have been collected with a complex design. Following that, they will test the methodology by using complex survey data as input. In particular, we will employ the related data from the NPHS. This will involve the incorporation of the corresponding survey weights.

4. Qinshu Ren: PhD student of J.N.K. Rao at Carleton University; Intern at Statistics Canada from September 2004 to May 2005. Research Project: Analysis of Longitudinal Survey Data with Binary or Ordinal Responses.

Within the context of marginal modeling of longitudinal survey data, Qinshu Ren, who is a PhD student of Jon Rao, is examining the issue surrounding binary or ordinal responses. Longitudinal surveys typically lead to dependent observations over time in addition to customary cross-sectional dependencies induced by the clustering in the sampling design. He is applying the theory for marginal models developed by Rao (1998) to the NPHS data and extending the methodology using odds ratio to model associations between pairs of binary or ordinal responses. Previous work on the NPHS data used only cross-sectional data. Initial results demonstrate the advantages of longitudinal analysis in terms of efficiency of estimators and estimating change parameters. Survey design features were accounted for by using the bootstrap variance estimation method for stratified multi-stage sampling. Ren, who is currently visiting Statistics Canada on a MITACS/NPCDS Internship related to the seed project to this proposal, supervised by Georgia Roberts, has implemented the bootstrap method for longitudinal data analysis using the bootstrap weights developed at Statistics Canada.

5. Ivan Carrillo-Garcia: PhD student of Changbao Wu and Jiahua Chen at University of Waterloo; Intern at Statistics Canada from September to December 2005. Research Project: Analysis of Longitudinal Surveys with Missing Observations.

All surveys, either cross-sectional or longitudinal, have at least some amount of nonresponse; the two types of nonresponse are unit nonresponse and item nonresponse. To be able to draw inferences in the presence of either type of missingness, the analyst makes assumptions about the response mechanism underlying them. These assumptions can be either explicit or implicit, but are always a must. The three commonly assumed response mechanisms are MCAR, MAR, and NMAR. The MCAR, or missing completely at random, mechanism assumes that the probability of getting a missing value is completely independent of the measurement process. MAR, or missing at random, means that the probability of a missing value is conditionally independent of the unobserved measurements given the values already observed. And in the NMAR, or non missing at random (also called non-ignorable response), the analyst assumes that the probability of a missing value depends on its actual value, even after conditioning on all the observed quantities. The additional variability in the estimates, introduced by the response mechanism, should not be ignored. There is extensive literature about the nonresponse issues for cross sectional surveys and their properties, as well as comparisons among them. For longitudinal surveys, matched with the GEE methodology, the missing data problem, the properties of the different assumed response mechanisms, and their impacts on variances are much less known. This internship research project studies the modeling of longitudinal survey data from a joint randomization perspective, with particular interest in examining the properties of estimators obtained from the GEE methods when missing responses are filled through weighted or unweighted random hot-deck imputation.

6. Xiaojian Xu: PhD student of Doug Wiens at University of Alberta; Intern at Statistics Canada from September to December 2005. Research Project: Treatments of Link Nonresponse in Indirect Sampling.

The focus of this project was primarily on the development of methodology in case of estimating cross-sectional population total using longitudinal survey data in context of indirect sampling. Indirect sampling refers to selecting samples from the population which is not, but it is related to, the target population of interest. Such sampling scheme is carried out often when we do not have sampling frames for the target population, but have sampling frames for another population (sampling population) which is related to it. The generalized weight share method (GWSM) for production of cross-sectional estimates using longitudinal survey data are provided by Lavallée (1995). This weighting scheme provides unbiased estimates irrespective of sampling s chemes in obtaining a sample in the sampling population. In the process of GWSM implementation, adjustment for a variety of nonresponse problems has to be done as any other weighting schemes. However, with indirect sampling there is another type of nonresponse called link nonresponse. It is the situation where it is impossible or failed to determine whether a unit in sampling population is related to a unit in target population or not. Link nonreponse causes severe overestimation when GWSM is used without proper adjustment. A few adjustment methods in correcting the estimation bias caused by link nonresponse were proposed and tested during this internship. The simulation results show that these proposed methods perform well in both reducing estimation bias and variance.

7. Odesh Singh: MS student of Michael Escobar at University of Toronto; Intern at Toronto Rehabilitation Institute from November 2004 to April 2005. Research Project: Aquired Brain Injury in an Administrative Health Database.

8. Zheng Zheng: PhD student of Nancy Reid at University of Toronto; Intern at Statistics Canada from January to May 2006. Research Project: Bootstrap Methods and Their Asymptotic Properties for Complex Surveys.

The main objectives of the proposed research are to develop a high order asymptotic (HOA) method and suitable bootstrap methods for confidence interval estimation using data from complex sampling designs, and to investigate the connections between those methods. The study will focus on a typical sampling scheme employed to collect data from a finite survey population where there is interest in both superpopulation and finite population parameters. If analytical solutions do not exist or are cumbersome to apply, it may be possible to make or derive additional simplifications using the estimating functions bootstrap and to use pseudo-likelihood instead of likelihood when applying the HOA method. The high order asymptotic method to be used was developed by Fraser and Reid (1995). The bootstrap methods reviewed in DiCiccio and Efron (1996), the estimating functions bootstrap of Kalbfleisch and Hu ( 2002 ) and the linearized estimating functions bootstrap of Binder et al (2004 )will be examined in the context of sample surveys.  New methods will be investigated theoretically and also implemented on some current Statistics Canada data sets.

9. Devon Chunfang Lin: PhD student of Randy Sitter at Simon Fraser University; Intern at Westat Inc. in 2006. Research Project: Replication Variance Estimation in Two-Stage Sampling.

In two-stage sampling in complex surveys, replication-based variance estimation is often applied to the first-stage sampling units. Theoretical justification is based on the assumption that the first-stage sampling fraction is negligible. Motivated by some surveys where this assumption is not met, this project develops adaptations of the method of balanced repeated replications and bootstrapping that that do not require this assumption, explores the asymptotic properties of the derived variance estimators, and conducts simulation studies to evaluate the performance of the proposed methods with finite sample size.

10. Cindy Xin Feng: PhD student of Randy Sitter at Simon Fraser University; Intern at Westat Inc. in 2006. Research Project: Confidence Intervals for Proportions and Quantiles under Two-stage Sampling Designs.

It has been well known that the conventional confidence interval for population proportions does not perform well for large or small values of proportions. Several alternative methods have been proposed in the literature, where the sample data are independent and identically distributed. For finite populations the problem is further complicated due to the use of complex sampling designs and issues related to effective sample sizes and effective degrees of freedom. This project investigates the performance of several confidence intervals for proportions and quantiles under two-stage sampling designs through simulation studies. An application to the U.S. National Health and Nutrition Examination Surveys (NHANES) is discussed.

11. Huanhuan Wu: MS student of Carl Schwarz and Tom Loughin at Simon Fraser University; Intern at BC Ministry of Agriculture and Land from September to December 2006. Research Project: Antibiotic resistance surveillance program.

12. Michelle Qian Zhou: PhD student of Peter Song at University of Waterloo; Intern at Statistics Canada from May to August 2007. Research Project: Small Area Estimation Methods for Analysis of Spatial and Longitudinal Survey Data.

Area level models such as Fay-Herriot models (Fay and Herriot, 1979) have been widely used to obtain reliable model-based estimators in small area estimation. However, in the model, two strong assumptions are made. One is that the sampling error variances are customarily assumed to be known, and the other is that the area-specific random effects are assumed to be independent and identically distributed. This research project investigates four full hierarchical Bayes (HB) models which relax these two strong assumptions by constructing Gaussian conditional autoregressive (CAR) models on the area-specific effects to induce spatial correlation, and/or assuming the sampling variances unknown. Through analysis of the survey data from Cycle 1.1 of Canadian Community Health Survey (CCHS), we make comparison among the HB model-based estimates and direct design-based estimates for the rate of asthma for the 20 health regions in BC province. Our results have shown that the model-based estimates perform better than the direct estimates. In addition, the proposed area-level CAR models have smaller CVs than the Fay-Herriot model which imposes independent area-specific random effects. Moreover, larger number of neighbours offers more efficient information in CAR models, leading to greater CV reduction over the Fay-Herriot model.

13. Xingqiu Zhao: PhD student of Nara Balakrishnan at McMaster University; Intern at Statistics Canada from May to August 2007. Research Project: Semiparametric Regression Analysis of Longitudinal Survey Data with Informative Dropouts.

Most of the existing work in the literature focuses on the analysis of non-survey longitudinal data. This research project considers the problem of longitudinal surveys with informative dropouts, and proposes a semiparametric regression approach and discuss the parameter estimation under the joint randomization due to both the model and the sampling selection. The variances of the proposed estimators consist of two components: the model variance and the design variance. The estimators of the model variance are derived and the design variance can be estimated by a design-based approach. The method developed is illustrated with data from the National Longitudinal Survey of Children and Youth.

14. Taslim Mallick: PhD student of Brajendra Sutradhar at Memorial University; Intern at Statistics Canada from September to December 2007. Research Project: Analysis of Imcomplete Longitudinal Survey Data.

In a longitudinal study, it is likely that individuals' responses are missing in some follow-ups. This missingness can be completely random (MCAR) or random given the individual's previous history (MAR). Taslim's internship project is to analyze SLID longitudinal survey data with binary response variable assuming the missingness is of MAR type. The analysis is carried out through the proposed Weighted Generalized Quasi-likelihood (WGQL) approach. Results are compared with the existing Weighted Generalized Estimating Equation (WGEE) approach under MAR assumption and GQL under MCAR assumption. Simulation studies are performed by controlling the response model to follow an MAR data process with large missing responses for the SLID data to examine the effect of sampling design on estimation methods.

15. Zhijian Chen: PhD student of Changbao Wu and Grace Yi at University of Waterloo; Intern at Statistics Canada in Fall 2008. Research Project: Logistic Regression Analysis Using Complex Survey Data with Misclassification in an Ordinal Covariate.

Measurement errors are common in complex surveys where variables are often collected through non-standard procedures. We consider estimation of regression coefficients in logistic regression analysis using survey data, where misclassification of an ordinal covariate is present and is dependent on other variables. We propose to use the expected score method which employs a parametric assumption for the measurement error (misclassification) process. The method is applied to a data set from Canadian Community Health Survey (CCHS), where self-reported body mass index (BMI) is believed to be a risk factor for several chronic conditions. A limited simulation study is carried out to investigate the performance of the proposed method.

16. Dagmar Mariaca-Hajducek: PhD student of Jerry Lawless at University of Waterloo; Intern at Statisitcs Canada in Fall 2008 and Winter 2009. Research Project: Fitting Cox Models to Jobless Spell Durations in SLID.

This project examines fitting Cox PH models to jobless spell durations for individuals from the Survey of Labour and Income Dynamics (SLID), over a six-year period. Features like within-individual and within-cluster association in spell durations, dependent loss to follow-up (LTF) and non-ignorable sampling design are considered. Within-individual dependence is taken into account by including previous jobless history in the form of covariates. Dependent loss to follow-up and non-ignorable sampling are accounted for by using combined sampling and LTF inverse probability weights in the estimation procedure.

17. Yan Liu: PhD student of Bruno Zumbo at University of British Columbia; Intern at Statistics Canada in Fall 2008 and Winter 2009. Research Project: Challenges in Analyzing National Longitudinal Data: Cohort-Sequential Design and the Application of Sampling Weights When Using Structural Equation Modeling (SEM).

Cohort-sequential design has been introduced into longitudinal studies in order to deal with time constraints and maturation of population groups of interest, as well as with sample attrition. The analysis of longitudinal survey data using this research design brings several challenges for researchers. This paper aims to demonstrate and compare two analytical methods adopting SEM approach for analyzing cohort-sequential design of data. These two methods are illustrated by the NLSCY data. Because sampling weights have often been neglected in using SEM, which often biases the results, we also demonstrate how to apply sampling weights in this kind of modeling.

18. Dongmo Jiongo Valery: PhD student of David Haziza at University of Montreal; Intern at Statistics Canada from November 2009 to February 2010. Research Project: Robust Inference in the Presence of Influential Units in Surveys.

The project addresses two research problems: (i) Inference in the presence of outliers for imputed data. The conditional bias of a unit under both the so-called nonresponse approach and the imputation approach are first derived, which leads to a robust version of the usual imputed estimator, and then imputed values are calibrated on the proposed robust imputed estimator. (ii) Small area estimation. This is based on the work of Sinha and Rao (2009) on small area methods using linear mixed models with random small area effects and block diagonal covariance structures. Using a robust version of the BLUP and the conditional bias of a unit, the results of Beaumont, Haziza and Ruis-Gazen (2009) are extended to linear mixed models.

19. Haocheng Li: PhD student of Grace Yi at University of Waterloo; Intern at Statistics Canada in Fall 2009. Research Project: Statistical Analysis for Longitudinal Health Survey Data with Missing Observations.

The research concentrates on handling missing data in health surveys, and is carried out in two stages. In the first stage, Haocheng considers the case of missingness in both response and covariates. He uses models based on pseudo-likelihood approach and works out the consistency and efficiencies. Generalized linear mixed models is employed to feature the outcome process. Estimators and inferences are conducted to illustrate the associations of health related response variables and covariates. In the second stage, Haocheng explores more robust forms such as semiparametric models. Simulation studies are also conducted.

20. Chen Xu: PhD student of Jiahua Chen at University of British Columbia; Intern at Statistics Canada in Fall 2009. Research Project: Variable Selection with Large Scale Survey Data.

21. Zeinab Mashreghi: PhD student of David Haziza at University of Montreal; Current intern at Statistics Canada (September - December, 2010). Research Project: Bootstrap Variance Estimation in the Presence of Imputed Data.

22. Wei Lin: PhD student of Nancy Reid at University of Toronto; Current intern at Statistics Canada (September - December, 2010). Research Project: Embedding Experiments Within Surveys.