Workshop and Conference
Shortly after the initial announcement from MITACS of funding for this project, the team initiated its
activities with a workshop, entitled
Workshop on Statistical Methods for Complex Surveys. The workshop was held
at the Centre de recherches mathétiques in Montréal April 30 - May 2, 2003 and was funded by the National
Program for Complex Data Structures (NPCDS).
There were about 50 participants and 12 speakers. The workshop themes were directly related to the science as
described in this proposal: (1) variance estimation for complex without replacement sampling designs; (2) modeling of
correlated duration data from longitudinal surveys; (3) multi-level modeling of survey data; and (4) item response
theory for surveys. On the first day of the workshop researchers from Statistics Canada made presentations of four
complex surveys run by that organization. The focus of these presentations was on data analytic problems arising from
the complexity of these surveys. On the second and third days researchers, both statisticians and subject matter
specialists, made presentations that spoke directly to the workshop theme areas. There were two international speakers,
Chris Skinner of the University of Southampton and Dick Wiggins of City University in London, England. Several team
members also made presentations.
Since 2004, the team has established a tradition for holding invited sessions at annual meetings of Statistical
Soceity of Canada (SSC). One of the major features of these invited sessions is that many of the invited speakers are
students who conducted research through the project's internship programs. They include Irene Lu, Norberto
Pantoja and Qunshu Ren at SSC 2005, Ivan Carrillo, Xiaojian Xu and Chunfang Lin at SSC 2006, Taslim Mallick, Michelle
Zhou and Cindy Feng at SSC 2008, Zhijian Chen, Dagmar Mariaca-Hajducek and Yan Liu at SSC 2009.
Since 2006, the project has run a seminar series, with research presentations held at UBC, SFU, York, University of
Toronto, Waterloo, University of Windsor, University of Montreal, and Statistics Canada. These presentations are given
by team members, including Professors Jiahua Chen, David Haziza, J.N.K. Rao and Changbao Wu, and researchers from project
partner organizations, including Drs Milorad Kovacevic and Georgia Roberts of Statistics Canada.
Research Activities and Outcomes Related to Internship Programs
Central to this project is the pairing of academic researchers and their doctoral students and/or postdoctoral fellows
with researchers at Statistics Canada, Westat Inc and the Toronto Rehabilitation Institute. Internship positions are
advertised nationally for placements at Statistics Canada and internship students are selected based on
the merits of their research proposals and background qualifications. Students spend four-to-six months at the partner
organizations, identify and formulate research problems through discussions among the student, his/her assigned
supervisor(s) at the partner organization, and his/her academic supervisor(s) at the home university. In most cases the
research project continues after the internship and becomes an important part of the student's thesis research work.
By April 2010, this MITACS project has completed twenty student internship programs, among them fifteen are at
Statistics Canada. There are two additonal students who are currently holding internship positions
at Statistics Canada during Fall 2010 (September - December). Completed internship research projects to date are outlined
1. Irene Lu: PhD student of Roland Thomas at Carleton University; Intern at Statistics Canada from January to May
2004. Research Project: Embedding IRT in Structural Equation Models.
A natural question arising from preliminary work of Rolland Thomas on parameter estimation
techniques for linear models featuring latent variables was the extent to which score prediction would bias regression
parameters obtained when scores for children's math and reading ability based on data from the National Longitudinal
Survey of Children and Youth (NLSCY) database, were used by analysts in secondary research studies. Irene Lu recently
completed her PhD on related topics under Dr. Thomas' supervision and is now an Assistant Professor at York U. Using a
combination of simulation and calculation, she showed that the biases in regression parameters could be very severe in
realistic situations when a two-step approach was used (i.e., first estimate the Item Response Theory (IRT) scores,
then use the IRT scores in regression). She showed that IRT provided little advantage over other commonly used naïve
methods of score construction. This effect, referred to as finite-item bias in her work, is present in any two-step
method and disappears only as the number of items goes to infinity. One of her main contributions was to utilize the
connection between IRT and discrete structural equation modeling (IRT-SEM) to obtain consistent regression estimates
free of finite item bias. During her Statistics Canada residency period on a MITACS/NPCDS Internship in conjunction
with the seed project to this proposal, supervised by Harold Mantel, she replicated her results using data from the
Youth in Transition Survey (YITS), a complex sample survey. Her YITS analysis took full account of the sampling weights
to obtain design-consistent estimators using the IRT-SEM approach, and by comparing these consistent parameter estimates
to those obtained using a weighted two-step regression, she determined the extent of the finite item bias in her YITS
2. Wilson Lu: PhD student of Randy Sitter at Simon Fraser University; Intern at Westat Inc. in 2004.
Research Project: Replication Variance Estimation Methods and Confidentiality Issues with
Discolosure of Public Use Survey Data.
In this project, Wilson Lu worked with three researchers from Werstat Inc., Mike Brick, Leyla Mohadjer,
Sylvia Dorhmann and team member Randy Sitter and developed novel methods to handle issues surrounding non-disclosure
of confidential information in public releases of survey data. Specifically, an important issue in surveys is the conflict of
interest between information sharing and disclosure of personal information. Statistical agencies routinely release data for public
use with some information suppressed for confidentiality. If care is not taken, one can (partially) reconstruct the
stratum and/or cluster indicators and thus break confidentiality (i.e., one may be able to identify a specific company or, say,
AIDS patient). The research team was able to demonstrate these dangers of current approaches and propose a new approach - using
scheduling theory algorithms - to reduce the risk of breaches in confidentiality, while providing consistent variance estimates.
This technique has now been implemented in the US national health and nutrition examination survey, with significant impact on
3. Norberto Pantoja Galicia: PhD student of Mary Thompson at University of Waterloo; Intern at Statistics Canada
from September 2004 to February 2005. Research Project: Bivariate Density Estimation for INterval Censored Data from
Longitudinal surveys allow the observation of durations, but because
successive interviews are separated in time, the endpoints are often bounded within intervals rather than recorded
precisely. A PhD student, Norberto Pantoja Galicia, is working with Mary Thompson on the problem of making inference
about the order of two event times when the times are interval censored at random. Let T1 and T2 be the times of events
of interest, for example corresponding to becoming pregnant and smoking cessation respectively. Thompson and Pantoja
Galicia (2002) propose a formal nonparametric test for order. This test involves the estimation of the survivor
functions of T1 and T2 from the data, as well as the joint distribution of (T1, T2-T1).
Inspired by the ideas of Duchesne and Stafford (2001, Tech. Rep. 0106, U. of T.) and Braun, Duchesne and Stafford (2004,
Can. J. Statist., to appear), Thompson and Pantoja Galicia have developed an R program that deals with bivariate density
estimation for interval censored data. The next step, is being carried out at Statistics Canada where Norberto Pantoja
Galicia is currently visiting on a MITACS/NPCDS (National Program on Complex Data Structures) Internship related to the
seed project to this proposal, is to make the appropriate adaptation to this program when the data have been collected
with a complex design. Following that, they will test the methodology by using complex survey data as input. In
particular, we will employ the related data from the NPHS. This will involve the incorporation of the corresponding
4. Qinshu Ren: PhD student of J.N.K. Rao at Carleton University; Intern at Statistics Canada from September 2004 to
May 2005. Research Project: Analysis of Longitudinal Survey Data with Binary or Ordinal Responses.
Within the context of marginal modeling of longitudinal survey data, Qinshu Ren, who is a PhD student of Jon
Rao, is examining the issue surrounding binary or ordinal responses. Longitudinal surveys typically lead to dependent
observations over time in addition to customary cross-sectional dependencies induced by the clustering in the sampling
design. He is applying the theory for marginal models developed by Rao (1998) to the NPHS data and extending the
methodology using odds ratio to model associations between pairs of binary or ordinal responses. Previous work on the
NPHS data used only cross-sectional data. Initial results demonstrate the advantages of longitudinal analysis in terms
of efficiency of estimators and estimating change parameters. Survey design features were accounted for by using the
bootstrap variance estimation method for stratified multi-stage sampling. Ren, who is currently visiting Statistics
Canada on a MITACS/NPCDS Internship related to the seed project to this proposal, supervised by Georgia Roberts, has
implemented the bootstrap method for longitudinal data analysis using the bootstrap weights developed at Statistics
5. Ivan Carrillo-Garcia: PhD student of Changbao Wu and Jiahua Chen at University of Waterloo; Intern at Statistics
Canada from September to December 2005. Research Project: Analysis of Longitudinal Surveys with Missing Observations.
All surveys, either cross-sectional or longitudinal, have at least some amount of
nonresponse; the two types of nonresponse are unit nonresponse and item nonresponse.
To be able to draw inferences in the presence of either type of missingness, the
analyst makes assumptions about the response mechanism underlying them. These
assumptions can be either explicit or implicit, but are always a must. The three
commonly assumed response mechanisms are MCAR, MAR, and NMAR. The MCAR, or missing
completely at random, mechanism assumes that the probability of getting a missing
value is completely independent of the measurement process. MAR, or missing at random,
means that the probability of a missing value is conditionally independent of the
unobserved measurements given the values already observed. And in the NMAR, or non
missing at random (also called non-ignorable response), the analyst assumes that the
probability of a missing value depends on its actual value, even after conditioning on
all the observed quantities.
The additional variability in the estimates, introduced by the response mechanism,
should not be ignored. There is extensive literature about the nonresponse issues
for cross sectional surveys and their properties, as well as comparisons among them.
For longitudinal surveys, matched with the GEE methodology, the missing data problem,
the properties of the different assumed response mechanisms, and their impacts on
variances are much less known. This internship research project studies the modeling of longitudinal survey data from a
joint randomization perspective, with particular interest in examining the properties of estimators obtained from the
GEE methods when missing responses are filled through weighted or unweighted random
6. Xiaojian Xu: PhD student of Doug Wiens at University of Alberta; Intern at Statistics Canada from September to
December 2005. Research Project: Treatments of Link Nonresponse in Indirect Sampling.
The focus of this project was primarily on the development of methodology in case of estimating cross-sectional population total
using longitudinal survey data in context of indirect sampling. Indirect sampling refers to selecting samples from the population
which is not, but it is related to, the target population of interest. Such sampling scheme is carried out often when we do not
have sampling frames for the target population, but have sampling frames for another
population (sampling population) which is related to it.
The generalized weight share method (GWSM) for production of cross-sectional
estimates using longitudinal survey data are provided by Lavallée (1995).
This weighting scheme provides unbiased estimates irrespective of sampling s
chemes in obtaining a sample in the sampling population. In the process of
GWSM implementation, adjustment for a variety of nonresponse problems has to
be done as any other weighting schemes. However, with indirect sampling there
is another type of nonresponse called link nonresponse. It is the situation
where it is impossible or failed to determine whether a unit in sampling
population is related to a unit in target population or not. Link nonreponse
causes severe overestimation when GWSM is used without proper adjustment.
A few adjustment methods in correcting the estimation bias caused by link
nonresponse were proposed and tested during this internship. The simulation
results show that these proposed methods perform well in both reducing estimation bias and variance.
7. Odesh Singh: MS student of Michael Escobar at University of Toronto; Intern at Toronto Rehabilitation Institute from
November 2004 to April 2005. Research Project: Aquired Brain Injury in an Administrative Health Database.
8. Zheng Zheng: PhD student of Nancy Reid at University of Toronto; Intern at Statistics Canada from January to
May 2006. Research Project: Bootstrap Methods and Their Asymptotic Properties for Complex Surveys.
The main objectives of the proposed research are to
develop a high order asymptotic (HOA) method and suitable bootstrap
methods for confidence interval estimation using data from complex
sampling designs, and to investigate the connections between those
methods. The study will focus on a typical sampling scheme employed to
collect data from a finite survey population where there is interest in
both superpopulation and finite population parameters. If analytical
solutions do not exist or are cumbersome to apply, it may be possible
to make or derive additional simplifications using the estimating
functions bootstrap and to use pseudo-likelihood instead of likelihood
when applying the HOA method. The high order asymptotic method to be used was developed by Fraser and
Reid (1995). The bootstrap methods reviewed in DiCiccio and Efron
(1996), the estimating functions bootstrap of Kalbfleisch and Hu ( 2002
) and the linearized estimating functions bootstrap of Binder et al
(2004 )will be examined in the context of sample surveys. New methods
will be investigated theoretically and also implemented on some current
Statistics Canada data sets.
9. Devon Chunfang Lin: PhD student of Randy Sitter at Simon Fraser University; Intern at Westat Inc. in 2006.
Research Project: Replication Variance Estimation in Two-Stage Sampling.
In two-stage sampling in complex surveys,
replication-based variance estimation is often applied to the
first-stage sampling units. Theoretical justification is based on
the assumption that the first-stage sampling fraction is
negligible. Motivated by some surveys where this assumption is not
met, this project develops adaptations of the
method of balanced repeated
replications and bootstrapping that that do not require this
assumption, explores the asymptotic properties of the derived
variance estimators, and conducts simulation studies
to evaluate the
performance of the proposed methods with finite sample size.
10. Cindy Xin Feng: PhD student of Randy Sitter at Simon Fraser University; Intern at Westat Inc. in 2006.
Research Project: Confidence Intervals for Proportions and Quantiles under Two-stage Sampling Designs.
It has been well known that the conventional confidence interval for population proportions does not perform well for large or small
values of proportions. Several alternative methods have been proposed in the literature, where the sample data are independent
and identically distributed. For finite populations the problem is further complicated due to the use of complex sampling designs and
issues related to effective sample sizes and effective degrees of freedom. This project investigates the performance of several
confidence intervals for proportions and quantiles under two-stage sampling designs through simulation studies.
An application to the U.S. National Health and Nutrition Examination Surveys (NHANES) is discussed.
11. Huanhuan Wu: MS student of Carl Schwarz and Tom Loughin at Simon Fraser University; Intern at BC Ministry of
Agriculture and Land from September to December 2006. Research Project: Antibiotic resistance surveillance program.
12. Michelle Qian Zhou: PhD student of Peter Song at University of Waterloo; Intern at Statistics Canada from May to
August 2007. Research Project: Small Area Estimation Methods for Analysis of Spatial and Longitudinal Survey Data.
Area level models such as Fay-Herriot models (Fay and Herriot, 1979) have been
widely used to obtain reliable model-based estimators in small area estimation.
However, in the model, two strong assumptions are made. One is that the sampling
error variances are customarily assumed to be known, and the other is that the
area-specific random effects are assumed to be independent and identically distributed.
This research project investigates four full hierarchical Bayes (HB) models which relax these two
strong assumptions by constructing Gaussian conditional autoregressive (CAR) models on
the area-specific effects to induce spatial correlation, and/or assuming the sampling
variances unknown. Through analysis of the survey data from Cycle 1.1 of Canadian
Community Health Survey (CCHS), we make comparison among the HB model-based estimates
and direct design-based estimates for the rate of asthma for the 20 health regions in
BC province. Our results have shown that the model-based estimates perform better than
the direct estimates. In addition, the proposed area-level CAR models have smaller CVs
than the Fay-Herriot model which imposes independent area-specific random effects.
Moreover, larger number of neighbours offers more efficient information in CAR models,
leading to greater CV reduction over the Fay-Herriot model.
13. Xingqiu Zhao: PhD student of Nara Balakrishnan at McMaster University; Intern at Statistics Canada from May to
August 2007. Research Project: Semiparametric Regression Analysis of Longitudinal Survey Data with Informative
Most of the existing work in the literature focuses on the analysis of non-survey longitudinal data.
This research project considers the problem of longitudinal surveys with informative dropouts, and proposes a
semiparametric regression approach and discuss the parameter estimation under the joint randomization due to
both the model and the sampling selection. The variances of the proposed estimators consist of two components:
the model variance and the design variance. The estimators of the model variance are derived and the design variance
can be estimated by a design-based approach. The method developed is illustrated with data from the National Longitudinal
Survey of Children and Youth.
14. Taslim Mallick: PhD student of Brajendra Sutradhar at Memorial University; Intern at Statistics Canada from September
to December 2007. Research Project: Analysis of Imcomplete Longitudinal Survey Data.
In a longitudinal study, it is likely that individuals' responses are missing in some follow-ups. This missingness can be
completely random (MCAR) or random given the individual's previous history (MAR). Taslim's internship project is to analyze SLID
longitudinal survey data with binary response variable assuming the missingness is of MAR type. The analysis is carried out
through the proposed Weighted Generalized Quasi-likelihood (WGQL) approach. Results are compared with the existing
Weighted Generalized Estimating Equation (WGEE) approach under MAR assumption and GQL under MCAR assumption. Simulation studies
are performed by controlling the response model to follow an MAR data process with large missing responses for the SLID data
to examine the effect of sampling design on estimation methods.
15. Zhijian Chen: PhD student of Changbao Wu and Grace Yi at University of Waterloo; Intern at Statistics Canada in Fall 2008.
Research Project: Logistic Regression Analysis Using Complex Survey Data with Misclassification in an Ordinal Covariate.
Measurement errors are common in complex surveys where variables are often collected through non-standard procedures.
We consider estimation of regression coefficients in logistic regression analysis using survey data, where misclassification
of an ordinal covariate is present and is dependent on other variables. We propose to use the expected score method which
employs a parametric assumption for the measurement error (misclassification) process. The method is applied to a data set
from Canadian Community Health Survey (CCHS), where self-reported body mass index (BMI) is believed to be a risk factor for
several chronic conditions. A limited simulation study is carried out to investigate the performance of the proposed method.
16. Dagmar Mariaca-Hajducek: PhD student of Jerry Lawless at University of Waterloo; Intern at Statisitcs Canada in Fall 2008
and Winter 2009.
Research Project: Fitting Cox Models to Jobless Spell Durations in SLID.
This project examines fitting Cox PH models to jobless spell durations for individuals from the Survey of Labour and Income
Dynamics (SLID), over a six-year period. Features like within-individual and within-cluster association in spell durations,
dependent loss to follow-up (LTF) and non-ignorable sampling design are considered. Within-individual dependence is taken
into account by including previous jobless history in the form of covariates. Dependent loss to follow-up and non-ignorable
sampling are accounted for by using combined sampling and LTF inverse probability weights in the estimation procedure.
17. Yan Liu: PhD student of Bruno Zumbo at University of British Columbia; Intern at Statistics Canada in Fall 2008 and Winter
2009. Research Project: Challenges in Analyzing National Longitudinal Data: Cohort-Sequential Design and the Application of
Sampling Weights When Using Structural Equation Modeling (SEM).
Cohort-sequential design has been introduced into longitudinal studies in order to deal with time constraints and maturation
of population groups of interest, as well as with sample attrition. The analysis of longitudinal survey data using this research
design brings several challenges for researchers. This paper aims to demonstrate and compare
two analytical methods adopting SEM approach for analyzing cohort-sequential design of data. These two methods are illustrated by
the NLSCY data. Because sampling weights have often been neglected in using SEM, which often biases the results, we also
demonstrate how to apply sampling weights in this kind of modeling.
18. Dongmo Jiongo Valery: PhD student of David Haziza at University of Montreal; Intern at Statistics Canada from November 2009
to February 2010.
Research Project: Robust Inference in the Presence of Influential Units in Surveys.
The project addresses two research problems: (i) Inference in the presence of outliers for imputed data. The conditional bias of a unit
under both the so-called nonresponse approach and the imputation approach are first derived, which leads to a robust version of the usual
imputed estimator, and then imputed values are calibrated on the proposed robust imputed estimator. (ii) Small area estimation. This
is based on the work of Sinha and Rao (2009) on small area methods using linear mixed models with random small area effects and block
diagonal covariance structures. Using a robust version of the BLUP and the conditional bias of a unit, the results of Beaumont, Haziza
and Ruis-Gazen (2009) are extended to linear mixed models.
19. Haocheng Li: PhD student of Grace Yi at University of Waterloo; Intern at Statistics Canada in Fall 2009.
Research Project: Statistical Analysis for Longitudinal Health Survey Data with Missing Observations.
The research concentrates on handling missing data in health surveys, and is carried out in two stages. In the first stage, Haocheng
considers the case of missingness in both response and covariates. He uses models based on pseudo-likelihood approach and works out
the consistency and efficiencies. Generalized linear mixed models is employed to feature the outcome process. Estimators and
inferences are conducted to illustrate the associations of health related response variables and covariates. In the second stage,
Haocheng explores more robust forms such as semiparametric models. Simulation studies are also conducted.
20. Chen Xu: PhD student of Jiahua Chen at University of British Columbia; Intern at Statistics Canada in Fall 2009.
Research Project: Variable Selection with Large Scale Survey Data.
21. Zeinab Mashreghi: PhD student of David Haziza at University of Montreal; Current intern at Statistics Canada
(September - December, 2010).
Research Project: Bootstrap Variance Estimation in the Presence of Imputed Data.
22. Wei Lin: PhD student of Nancy Reid at University of Toronto; Current intern at Statistics Canada
(September - December, 2010).
Research Project: Embedding Experiments Within Surveys.