Tyranny in the Name of Science

Statistics Canada, our nation's top statistics bureau, is reportedly looking for powers to make all of their surveys — not just the long-form census — mandatory. Even as an academic statistician myself, I am alarmed by this new agenda and its implications on our civil liberties.

We often teach our students in introductory classes that there are two fundamentally different types of statistical studies, randomized experiments and observational studies. While it is true that one can draw much more powerful conclusions from the former, there are circumstances where one must rely on the latter despite the inherent limitations.

The most elementary — and widely used — example is the question of whether smoking increases the risk of cancer. In an ideal randomized experiment, subjects ought to be randomly assigned to smoke (or not) for a number of years. After the experiment is over, we would then compare the rate of cancer among those who had been forced to smoke with those who had not. In the classroom, it is almost always immediately clear to all the students that such a randomized experiment would be highly unethical and hence, impractical to conduct. And that's why almost all smoking studies are observational studies. Some people choose to smoke while others don't, and we merely observe that the rate of cancer is different among these two groups of people. The inherent limitation lies in the possibility that there could be other significant differences between these two groups of people that are responsible for their cancer (or the lack thereof), instead of whether they smoke or not. For example, there could be unknown genetic factors — or certain bacteria compositions in one's gut, or whatever — that both predispose one to smoke and elevate one's chance of developing cancer.

Such possibilities inevitably weaken any conclusion that one can draw from observational studies, but it is a limitation that we all have learned to live with. Conducting a randomized experiment on the effect of smoking would be tyranny in the name of science.

Mandatory data collection presumably does not go so far as to put our physical health at risk, but should there be limits on what a government (or one of its agencies) can legally force its citizens to disclose about their private lives? This is a question better answered by a political scientist or a philosopher than a statistician like myself. But if I have to guess, I think they will agree that the answer should be an unequivocal "yes" and they will disagree only on the extent of these limits.

While I will let them debate the extent of these limits, I want to state here what I think is the proper attitude statisticians should have. We have learned that, sometimes, we must live with the scientifically inconvenient limitations of observational studies, so why can't we also learn to live with the limitation that a democratic government should not be permitted to collect any data it wants about its citizens? Just as we cannot always run a randomized experiment simply because doing so would allow us to better answer a scientific question, so we shouldn't insist on collecting any data we want simply in the name of science (or better policy, or whatever).

Over the years, many statisticians have advanced the technology of analyzing observational data to allow us to draw stronger conclusions from them. Admittedly, there is still a lot to be done on that front and, while meticulous research will allow us to reduce the gap between what we can conclude from an observational study and what we can from an experimental one, we probably cannot completely eliminate it altogether. Oh well, c'est la vie! The fact that we cannot completely eliminate the gap still does not justify us to start a randomized experiment on smoking.

I think that, in the current debate, it is much more productive — and appropriate — for statisticians to focus on developing new technologies to deal with the lack of perfect data than for us to behave like a spoiled three-year-old and demand that we must be given what we want so that we can perform our idealized analyses. And there is plenty of evidence to suggest that it is possible for us to succeed in this regard.

For example, a few years ago, a team from Google, Inc. shook the data-analytic world a bit by showing us that they could predict a flu epidemic better than the U.S. Centers for Disease Control and Prevention (CDC) could by monitoring what people were looking up in its search engine. Later it was noticed that Google's predictions seemed somewhat biased, precisely because their data (what ordinary people were searching) were not the cleanest kind that one would have liked to analyze from a purely scientific point of view. However, this does not mean their original idea is doomed. Recently, a group of Harvard statisticians have shown that they could correct the apparent bias to a large extent by combining the almost real-time data from Google's search engine with the somewhat time-lagged data from the CDC, and by using slightly more sophisticated statistical techniques that take into account the time-series nature of the data.

Just like everybody else, we statisticians also have to live in the real world, not an idealized world, and we can better channel our professional energies to find the best solutions to our problems under various practical, moral and ethical constraints than to insist that society must feed our ego so that we can always deliver the "obvious", unconstrained optimal solution. We can do our work without infringing on the privacy and freedom of our fellow citizens.

Mu Zhu, PhD
Professor, Department of Statistics and Actuarial Science
University of Waterloo
Waterloo, Ontario, Canada

Original Composition: July 27, 2016 | This Version: July 28, 2016