Program

Reception, University Club
Thursday, October 27, 2011, 5:30-7:00pm
 
Day one conference, DC 1302
Friday, October 28, 2011
8:00 – 8:30 Café and Registration Materials (DC 1301)
8:30 – 8:45 Welcome
8:45 – 10:25


Survey Sampling
Chair: Matthias Schonlau
Speakers: Sharon Lohr and Chris Skinner
10:25 – 10:40 Café break (DC 1301)
10:40 – 12:20


Statistics in Social Science
Chair: Jerry Lawless
Speakers: Geoffrey Fong and Mark Handcock
12:20 – 1:40 Lunch (SAS Lounge, M3 3133)
1:40 – 3:20


Biostatistics
Chair: Richard Cook
Speakers: Xihong Lin and Jane-Ling Wang
3:20 – 3:35

Group photo (Front door steps of M3)

3:35 – 3:50 Café break (DC 1301)
3:50 – 5:30


Causal Inference
Chair: Cecilia Cotton
Speakers: Erica Moodie and Dylan Small
 
Banquet, Festival Room, South Campus Hall
Friday, October 28, 2011, 6:30 – 9:30pm
 
Day two conference, DC 1302
Saturday, October 29, 2011
8:15 – 8:45 Café (DC 1301)

8:45 – 10:25


Statistical Inference
Chair: Grace Yi
Speakers: Bruce Lindsay and Nancy Reid

10:25 – 10:40 Café break (DC 1301)
10:40 – 12:20


Statistical Learning
Chair: Ali Ghodsi
Speakers: Hugh Chipman and Robert Tibshirani
12:20 – 1:40 Lunch (SAS Lounge, M3 3133)

Abstracts

Blending Estimates from Surveys with Possible Bias

Sharon Lohr
Arizona State University

The National Crime Victimization Survey (NCVS) has been conducted since 1972 to measure characteristics and changes in victimization in the United States. Sample size reductions in recent years, however, have decreased the precision of the NCVS for national estimates of victimization. We describe the design of a new Companion Survey (CS) to the NCVS that will be piloted in 2012 to provide supplemental information about victimization. Different sampling modes and nonresponse mechanisms make it likely that the bias patterns of the NCVS and CS will differ. We discuss methods for combining estimates from the two surveys that account for different bias scenarios, and explore properties of the methods through a simulation study based on NCVS data.

This is joint work with J. Michael Brick of Westat.


Weighting in the Analysis of Survey Data: a Cross-national Application 

Chris Skinner
London School of Economics and Political Science

Survey weighting may be employed to correct for sample selection bias in the regression analysis of survey data. A potential disadvantage of weighting, however, is the inflation of standard errors.  Some methods of weight adjustment have been proposed recently which reduce this variance inflation, but still correct for selection bias.  This paper will discuss such adjustments. An application to the analysis of voter turnout data from the European Social Su rvey will be presented.  Within and between country effects will be contrasted.


The Concept of the ITC Project and the Importance of Mediational Models

Geoffrey Fong
University of Waterloo

Observational studies and other non-randomized studies have been criticized for their inability to discern causality. But in many important domains for science and life, observational studies are all that are possible. And thus it is dependent on those who must deal in such domains to build into the design of observational studies features and structures that will enhance the potential for making more confident judgments about possible causal effects.

In the past decade, one domain in global health—tobacco use—has emerged as one such domain. In 2003, all 192 countries of the World Health Organization adopted the Framework Convention on Tobacco Control (FCTC), the world’s first health treaty. The FCTC obligates the parties—the countries that have ratified the treaty, now numbering over 170 countries—to implement tobacco control policies such as graphic warning labels, smoke-free laws, higher taxes to reduce demand for tobacco, bans or restrictions on advertising and promotion of tobacco products, support for cessation, and measures to reduce illicit trade. 

The International Tobacco Control Policy Evaluation Project (ITC Project) was founded in 2002 as an international system for evaluating the impact of FCTC policies as they are being implemented throughout the world. In 20 countries, the ITC Project is conducting parallel cohort surveys that are designed to evaluate the effectiveness of national-level tobacco control policies. In addition to the cohort design with multiple countries, which allows quasi-experimental evaluations, the choice of measures associated with each policy domain are guided by explicit mediational models, which have been formulated for each policy domain. These mediational models arise from existing theoretical and empirical work in each policy domain. Thus, the mediational model for warnings is different from the mediational model for smoke-free laws.

This presentation will provide an overview of the ITC Project, including a discussion of these mediational models, with examples of their usefulness as well as the challenges ahead. And of course there will be discussion and stories of Mary Thompson’s enormous past and continuing contributions to the ITC Project.


Statistical Methods for Sampling Hard-to-Reach Networked Populations

Marc Handcock
University of California, L.A.

This talk will provide an overview of probability models and inferential methods for the analysis of data collected using Respondent Driven Sampling (RDS). RDS is a sampling technique for studying hidden and hard-to-reach populations when effective sampling frames cannot easily be obtained. RDS has been widely used to sample populations at high risk of HIV infection and has also been used to survey undocumented workers and migrants. RDS avoids the explication of the sampling frame by using a referral chain of dependent observations: starting with a small group of seed respondents chosen by the researcher, the study participants themselves recruit additional survey respondents by referring their friends into the study. As an alternative to frame-based sampling, the chain-referral approach employed by RDS can be extremely successful as a means of recruiting respondents.

Traditionally estimation has relied on sampling weights estimated by treating the sampling process as a random walk on a graph, where the graph is the social network of relations among members of the target population. These estimates are based on strong assumptions allowing the sample to be treated as a probability sample. However these assumptions are seldom viable in practice and the estimators have poor statistical properties.

We discuss new estimators which improve inference from RDS data. The first is based on a without-replacement approximation to the sampling process introduced by Gile (2011).  The second is model-assisted based on fitting an exponential-family random graph model to the social network.  We demonstrate their ability to correct for biases due to the finite population and initial convenience sample.

This is joint work with Krista J. Gile, University of Massachusetts, Amherst.


Efficient Tests for the SNP-set/Gene Set Effects in Population Based Studies

Xihong Lin
Harvard School of Public Health

In recent years, genome-wide association studies (GWAS) and gene-expression profling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared to the traditional marginal analysis of single markers. Statistical procedures for testing the overall effect of a set of genetic markers have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework have been proposed as powerful alternatives to the standard Rao's p-degree freedom score test. The advantages of these EB based procedures are most apparent when the markers are moderately or highly correlated due to the reduction in the degree of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB based score test and the adaptive test, we further construct an omnibus test that attains a good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or 2 approximations. Via extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.


Modeling Left-truncated and Right Censored Survival Data with Longitudinal Covariates

Jane-Ling Wang
University of California, Davis

In this talk, we explore the modeling of survival data in the presence of longitudinal covariates. In particular, we consider survival data that are subject to both left truncation and right censoring. It is well known that traditional approaches, such as the partial likelihood approach for the Cox proportional hazards model encounter difficulties when longitudinal covariates are involved in the modeling of the survival data. A joint likelihood approach has been shown in the literature to provide an effective way to overcome those difficulties for right censored data. However, in the presence of left truncation, there are additional challenges for the joint likelihood approach. We propose an alternative likelihood to overcome these difficulties and establish the asymptotic theory, including semiparametric efficiency of the new approach. The approach will also be illustrated numerically.

 *The talk is based on joint work with Yuru Su.


Q-learning for Estimating Optimal Dynamic Treatment Rules from Observational Data

Erica Moodie
McGill University

The area of dynamic treatment regimes (DTR) aims to make inference about adaptive, multistage decision-making in clinical practice. A DTR is a set of decision rules, one per interval of treatment where each decision is a function of treatment and covariate history which returns a recommended treatment. Q-learning is a popular method from the reinforcement learning literature that has recently been applied to estimate DTRs. While, in principle, Q-learning can be used for both randomized and observational data, the focus in the literature thus far has been exclusively on the randomized treatment setting. We extend the method to incorporate measured confounding covariates, using propensity scores and inverse probability weighting. We provide an extensive simulation study to compare different approaches to account for confounding in the Q-learning framework; the methods are examined under a variety of settings including practical violations of positivity and in nonregular scenarios. We illustrate the methods in examining the effect of breastfeeding on IQ in the PROBIT data.

This is joint with Bibhas Chakraborty (Columbia University).


Causal Inference for Continuous Time Processes when Covariates are Observed only at Discrete Times

Dylan Small
University of Pennsylvania

Much work on causal inference for longitudinal data has assumed a discrete time underlying data generating process.  However, in some observational studies, it is more reasonable to assume that the data are generated from a continuous time process, and are only observable at discrete time points. When these circumstances happen, the sequential randomization assumption in the observed discrete time data, which is essential in justifying discrete time g-estimation causal inference for longitudinal data, may not be reasonable.  We discuss other useful assumptions that guarantee the consistency of discrete time g-estimation. In more general cases, when that assumption is violated, we propose a new method that provides at least as good performance as g-estimation in most scenarios, and provides consistent estimation in some cases in which g-estimation is severely inconsistent.


Fisher Information and Projection Pursuit

Bruce Lindsay
Pennsylvania State University

We revisit two old ideas, and make some interesting new connections. In a multivariate context, the Fisher's information for a location problem can be turned into a diagnostic matrix for "interesting projections". Projections using the eigenvectors of this matrix describe the data directions with the least conditional normality. We show how one can create a practical projection pursuit methodology that is no more computationally difficult than principal components analysis. Links will be made with traditional projection pursuit as well as the more modern independent components analysis.


Likelihood Inference for Complex Problems

Nancy Reid
University of Toronto

Inference based on the likelihood function owes much to theory developed some decades ago. What is the current role of likelihood in developing strategies for the analysis of very large data sets, often with very high dimension, and complex dependencies?  This talk will consider some aspects of this question with emphasis on problems in stochastic modelling, estimating equations, and survey methodology.


Better Statistical Learning via Sequential Design or "Active Learning"

Hugh Chipman
Acadia University

In supervised learning problems, "Active Learning" refers to the iterative process of sequential data selection and model building, with the goal of building a better model while requiring fewer observations. From a statistical viewpoint, sequential design of experiments seeks to solve a similar problem, but often focuses on parametric models such as linear regression. We consider the challenge of Active Learning with very flexible models. In order to meet this challenge, a framework for formal statistical inference must be available for the model. We consider Bayesian Additive Regression Trees, a flexible ensemble model which can deal with high dimensionality, irrelevant predictors, nonlinear relationships, interactions, and local effects. We will discuss issues involved in the development of BART as an active learning tool.  Active learning for computer experiments, in which the response can be a complex but deterministic function of the inputs, will be an important special case.

 


The Lasso: Some Novel Algorithms and Applications

Robert Tibshirani
Stanford University

I will discuss some procedures for modelling high-dimensional data, based on L1(lasso) -style penalties. I will describe pathwise coordinate descent algorithms for the lasso, which are remarkably fast and facilitate application of the methods to very large datasets for the first time. I will then give examples of new applications of L1 penalties to microarray classification,  the fused lasso for signal detection and the matrix completion problem.