Statistical Models in S edited by John M. Chambers and Trevor J. Hastie 1992 608 pages ISBN 0-534-16765-9 Wadsworth and Brooks/Cole Advanced Books and Software Contents: 1. An Appetizer (by J.M. Chambers, T.J. Hastie) 2. Statistical Models (by J.M. Chambers, T.J. Hastie) 3. Data for Models (by J.M. Chambers) 4. Linear Models (by J.M. Chambers) 5. Analysis of Variance; Designed Experiments (by J.M. Chambers, A.E. Freeny, R.M. Heiberger) 6. Generalized Linear Models (by T.J. Hastie, D. Pregibon) 7. Generalized Additive Models (by T.J. Hastie) 8. Local Regression Models (by W.S. Cleveland, E. Grosse, W.M. Shyu) 9. Tree-Based Models (by L.A. Clark, D. Pregibon) 10. Nonlinear Models (by D.M. Bates, J.M. Chambers) Appendix A. Classes and Methods: Object-Oriented Programming in S (by J.M. Chambers) Appendix B. S Functions and Classes New programming functionality has been added to the ``New S'' language since the publication of the ``New S'' manual in 1988 (The New S language: A programming environment for data analysis and graphics, by R.A. Becker, J.M. Chambers, and A.R. Wilks). By using this extension to the ``New S'' language, the ten authors of this manual are able to develop a unified approach to the fitting and analysis of a fairly complete collection of response models (traditional and recent). The book is highly recommended to anyone interested in using New S (or is it New-New S?), in applying more recent response models, or in research in statistical computing. The editors are to be congratulated for a surprisingly well organized book (given ten authors were involved). Each chapter is organized into four primary sections. The first describes the statistical methodology, the second how to use the S functions and data structures, the third how to extend or specialize the given software, and the fourth contains more detail on the computations. Consequently, by reading only the first two primary sections of each chapter one comes away well equipped to use the software in a host of applications. For most readers this will be enough. As interest and circumstance demand, the remaining sections of any chapter can be read with profit. Although early chapters are required reading for later chapters, chapters 7 through 10 can be read independently of one another. The only problem with an otherwise excellent piece of work is the software model on which the statistical models are based. ``Object-oriented programming in S'' is unlike any object-oriented programming language I know. At best, New S's functional programming style has been extended so that the user can write functions which dispatch to other functions depending only on the value of the ``class'' attribute of one of its arguments. True, this makes it possible to write functions which are ``generic'' but it is a far cry from object-oriented programming. Despite the impression given to the casual reader, there is no such thing as a ``class'' in this New S; there is merely an attribute called ``class'' which can appear on any S data structure. The generic functions look to this attribute to decide which one of a collection of New S functions (called methods) to invoke. The dispatching is often called ``method lookup'' and in the New S model is confused with the definition of a class. In an object-oriented programming language, classes are data structures which can themselves be manipulated. Minimally, they can be related one to another through the notion of inheritance. As an example consider using classes as data structures to describe birds. We might define a general class called ``bird'' which would be a template data structure representing the properties held by birds in general. A second class called ``flightless-bird'' could be introduced to represent birds which have evolved to a flightless state (e.g. penguins and ostriches). It is clear that every element of the class ``flightless-bird'' is also an element of the class ``bird''. It is also clear that the converse does not hold; an element of the class ``bird'' is not necessarily also an element of the class ``flightless-bird''. This distinction is reflected in the software by asserting that ``flightless-bird'' is a subclass of ``bird''. Consequently any property of ``bird'' is inherited by ``flightless-bird''. If I had a pet ostrich called Frank, he would be represented in this system as an ``instance'' of the class ``flightless-bird'' -- a ``flightless-bird object''. A generic function that operated on birds might be ``fly'' which would cause the bird-object to fly from its present position to a new specified position. If applied to the object representing Frank however nothing should happen because Frank is a ``flightless-bird''. This is implemented in software by defining a generic function called ``fly'' and separate ``fly'' methods for each of the classes ``bird'' and ``flightless-bird''. The method lookup procedure typically traverses the inheritance hierarchy of the classes to determine which is the most specific method for a given argument to the generic function call. (In some systems, this lookup can be redefined.) In the extended New S system, Frank would be represented by making a ``bird'' data structure and pushing the string ``flightless-bird'' onto its class attribute vector. No class called ``flightless-bird'' would exist as a data structure. A separate New S function, fly.flightless-bird, would be defined to represent the fly method for flightless-birds. So far so good. In an object-oriented programming language extending a system developed using classes and generic functions is simple and powerful. Should I wish to isolate the properties and behaviours which distinguish migratory birds I do so by defining a class, ``migratory-birds'', as a subclass of ``bird'' distinct from the ``flightless-bird'' class. I need not re-implement any methods and data-fields for migratory birds which are defined for birds in general. Should I wish to Consider now a statistical example -- implementing generalized linear models (glm) and standard linear models (lm). Because a linear model is really a special kind of generalized linear model one might naturally define two classes, say ``lm'' and ``glm'' and assert that ``lm'' is a specialized subclass of ``glm''. As a consequence, whatever property one expects of a ``glm'' would also be found on an ``lm'' since it is simply a special kind of ``glm''. through ``inheritance''. Method lookup will typically traverse Many object-oriented systems have long since left the Smalltalk-80 model where generic functions can dispatch on the type of only one of its arguments and now permit dispatching to depend on the type of any number of its arguments. The Common Lisp object system is an example of one such system. It is difficult to see how the New S could be extended yet again to accommodate this kind of method-lookup. There are no classes in this extension of new S. While it is claimed througout that object-oriented programming is used throughout this is not so. programming The standard response models of the classic linear model (lm) and the generalized linear model (glm) (including quasi-likelihood models) provide the basic intuition for the design of the unified approach. Newer methodologies like tree-based models for classification and regression, local regression models (loess), and generalized additive models (gam) are treated in similar fashion. Here ``similar fashion'' is an understatement; any common elements of the analysis in these response models are enforced by the design of the software. For example, with the exception of the non-linear model, the fitting procedure of any response model accepts an extended version of the Wilkinson and Rogers notation for specifying the structural part of the model (1973, Applied Statistics, 22, pp 392-399). (A non-linear model must explicitly define its parameters.) While the commonality is emphasised, specialized treatment in specific circumstances is encouraged. For example, to ensure the correct analysis of variance for some experimental designs the formula specification is extended to allow identification of different error sources for analysis of variance data structures (aov). Common and specialized behaviour for different response models is easily specified programmatically through the two extensions to the New S language described in this book. The first is given by the twin notions of generic functions and specialized methods. As an example, consider the function ``anova''. As its first argument, it takes a fitted ``model'' data structure and produces an anova style table summarizing the fitted model. ``Anova'' should (and does) work for any fit produced by an aov an lm, a glm, a gam, and a loess fit. By this it is meant that there is some sense in which we would like to producing an anova-like table for any of these fits. Yet what should be produced will depend on the kind of data structure given as its first argument. If for example a glm fit is given, then an appropriate ``analysis of deviance'' table is printed. This specialization is achieved by having the ``anova'' function automatically dispatch to the function ``anova.glm'' whenever it is presented with a glm fitted model. Here the ``anova'' is a generic function and ``anova.glm'' one of its specialized methods. The second extension allows arbitrary S data structures to be related to one another through some kind of inheritance. This is implemented by adding a new attribute called ``class'' on S data structures. For example, a glm fitted model will have as its class attribute the vector (in New S terminology) given by <"glm", "lm">. Operationally this means that any generic function (e.g. "anova") that is called on a glm will look first for a function of the same name but ending in ".glm" (e.g. "anova.glm") to apply to the argument. If there is one then it is used. If there is not, it looks again but this time for one ending in ".lm" (e.g. "anova.lm"). If the entire vector of class attributes fails to turn up an appropriate function, then finally the ending ".default" is tried (e.g. "anova.default") -- there may or may not be a ".default" method defined. On the surface, these two extensions seem to endow S with some of the principal features that have come to be associated with the phrase "object-oriented programming" (oo-programming for short). Indeed, the book is strewn throughout with the common terminology of oo-programming (e.g. classes, generic functions, methods and the like). The reader should not be misled. What may be "Object-oriented programming in S'' as described in the Appendix A is at best a poor cousin to what is generally understood to be oo-programming. In oo-programming languages with which I have worked (Smalltalk-80, LOOPS, CLOS) classes exist as data structures in their own right. They can be manipulated, instantiated and ... THIS IS GOING TO BE FAIRLY TECHNICAL AND NOT A LITTLE NEGATIVE. The difficulty I have with the book (and hence the software) is its view of object-oriented programming and the consequences this view has had on the design of the statistical software. 1. classes do not exist in their own right, there is no relation ship between classes. DINDE, Arizona, and Quail are what is known as class-based systems. In such systems a distinction is drawn between a class data structure and the data structure called an instance of that class; the former is a template for the latter. Classes are related one to another through inheritance. In contrast to class-based systems Lisp-Stat is a prototype or exemplar based system. That means there is no distinction between instances and classes -- any object can inherit properties and behaviour from any other object. New S's is purportedly a class based system but more accurately is