Subset selection in regression: the bad news
DM STAT-1 CONSULTING BRUCE RATNER, PhD
574 Flanders Drive North Woodmere, NY 11581 [email protected]
516.791.3544 fax 516.791.5075 1 800 DM STAT-1 www.dmstat1.com
The first half of the following material is copyrighted material, belonging to Bruce Ratner, as found in his book Statis-tical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data
, CRC Press, Boca Raton, 2003. Neither the above titled book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the author. Used with permission.
Stepwise is a Problematic Method for Variable Selection in Regression:
Variable selection in regression – identifying the best subset among many variables to include in a model – is arguably the hardest part of model building. Many variable selection methods exist. Many statisticians know them, but few know they produce poorly performing models. The wanting variable selection methods are a miscarriage of statistics because they are developed by debasing sound statistical theory into a misguided pseudo-theoretical foundation. The purpose of this article is two-fold: 1) To review five widely used variable selection methods, among which stepwise is arguably the most popular, itemize some of their weaknesses, and answer why they are used. 2) To compile a list of reliable alternative methods – unproblematic for variable selection in regression. I. Background Classic statistics dictates that the statistician sets about dealing with a given problem with a pre-specified procedure designed for that problem. For example, the problem of predicting a continu-ous target variable (e.g., profit) is solved by using the ordinary least squares (OLS) regression model along with checking
the well-known underlying OLS assumptions.  At hand there are several
candidate predictor variables, allowing a workable-task for the statistician to check as-sumptions (e.g., predictor variables are linearly independent). Likewise, the dataset has a practica-ble
number of observations, making it also a workable-task for the statistician to check assump-tions (e.g., the errors are uncorrelated). As well, the statistician can perform the regarded yet often-discarded
exploratory data analysis (aka EDA), such as examine and apply the appropriate reme-dies for individual records that contribute to sticky data-characteristics (e.g., gaps, clumps, and outliers). Quite important, EDA allows the statistician to assess whether or not a given variable, say, X needs a transformation/re-expression (e.g., log(X), sin(X) or 1/X). The traditional variable selection methods cannot do such transformations or a priori construction of new variables from the original variables. [1.1] Inability
of construction of new variables is a serious weakness of the variable selection methodology.  Nowadays, building an OLS regression model or a logistic regression model (LRM where the tar-get variable is binary: yes-no/1-0) is problematic because of the size of the dataset. Modelers work on big data
consisting of a teeming multitude
of variables, and an army
of observations. The workable-tasks are no longer feasible. Modelers cannot sure-footedly use OLS regression and
LRM on big data, as the two statistical regression models were conceived, testing and experi-mented within the small-data setting
of the day over 50 and 205 years ago, for LRM and OLS re-gression, respectively. The regression theoretical foundation and the tool of significance testing employed on big data are without statistical binding force. Thus, fitting big data to a pre-specified small-framed
model produces a skewed
model with doubtful interpretability and questionable re-sults. Folklore
has it that the knowledge and practice of variable selection methods were developed when small data slowly grew into the early size of big data circa late 1960s/early 1970s. With only a single bibliographic citation ascribing variable selection methods to unsupported notions, I be-lieve a reasonable scenario of the genesis of the methods was as follows:  College statistics nerds (intelligent thinkers) and computer science geeks (intelligent doers) put together the variable selection methodology using a trinity of selection-components:
1) Statistical tests (e.g., F, chi-square, and t tests, and significance testing)
2) Statistical criteria (e.g., R-squared, adjusted R-squared, Mallows’ Cp
and MSE [3.1])
3) Statistical stopping rules (e.g., p-values flags
for variable entry/deletion/staying in a model
The created body of unconfirmed thinking about the newborn-developed variable selection meth-ods was on bearing soil of expertness and adroitness in computer-automated, misguided statistics. The trinity distorts its components’ original theoretical and inferential meanings when they are framed within the newborn methods. The statistician executing the computer-driven trinity of sta-tistical apparatus in a seemingly intuitive, insightful way gave proof – face validity
– that the prob-lem of variable selection, aka subset selection, was solved (at least to the uninitiated statistician). The newbie subset selection methods initially enjoyed wide acceptance with extensive use, and presently still do. Statisticians build at-risk
accurate and stable models – either unknowingly
using these unconfirmed methods
exercise these methods because they know not what to do
. It was not long before these methods’ weaknesses, some contradictory, generated many com-mentaries in the literature. I itemize nine ever-present weaknesses, below, for two of the traditional variable selection methods, All-subset
, and Stepwise
. I concisely describe the five frequently used variable selection methods in the next section.
1. For All-subset
selection with more than 40 variables: 
a. The number of possible subsets can be huge. b. Often, there are several good models, although some are unstable. c. The best X variables may be no better than random variables,
if size sample is relatively small to the number of all variables.
d. The regression statistics and regression coefficients are biased.
selection regression can yield models that are too small
.  3. Why the number of candidate variables and not the number in the final model is the
number of degrees of freedom to consider. 
4. The data analyst knows more than the computer … and failure to use that knowledge
selection yields confidence limits that are far too narrow.  6. Regarding frequency of obtaining authentic and noise variables … The degree of corre-
lation among the predictor variables affected the frequency with which authentic pre-dictor variables found their way into the final model. The number of candidate predic-tor variables affected the number of noise variables that gained entry to the model. 
selection will not necessarily produce the best model if there are redundant
8. There are two distinct questions here: (a) When is Stepwise
9. As to question (b) above … there are two groups that are inclined to favor its usage.
One consists of individuals, with little formal training in data analysis, which confuses knowledge of data analysis with knowledge of the syntax of SAS, SPSS, etc. They seem to figure that if its there in a program, its gotta be good and better than actually thinking about what my data might look like
. They are fairly easy to spot and to con-demn in a right-thinking group of well-trained data analysts. However, there is also a second group who is often well trained …. They believe in statistics … given any properly obtained database, a suitable computer program can objectively make sub-stantive inferences without active consideration of the underlying hypotheses. … Stepwise selection is
the parent of this line blind data analysis
Currently, there is burgeoning
research that continues the original efforts of subset selection by shoring up its pseudo-theoretical foundation. It follows a line of examination that adds assump-tions and makes modifications for eliminating the weaknesses. As the traditional methods are be-ing mended, there are innovative approaches with starting points far afield from their traditional counterparts. There are freshly minted methods, like the enhanced variable selection method
built-in the GenIQ Model, constantly being developed.     II. Introduction Variable selection in regression – identifying the best subset among many variables to include in a model – is arguably hardest part of model building. Many variable selection methods exist because it provides a solution to one of the most important problems in statistics.   Many statisticians know them, but few know they produce poorly performing models. The wanting variable selection methods are a miscarriage of statistics
because there are developed by debasing sound statistical theory into a misguided pseudo-theoretical foundation. They are executed with computer-intensive search heuristics guided by rules-of-thumb. Each method uses a unique trio of elements, one from each component of the trinity of selection-components.  Different sets of elements typically produce different subsets. The number of variables in common with the different subsets is small, and the sizes of the subsets can vary considerably.
An alternative view of the problem of variable selection is to examine certain subsets and select
the best subset, which either maximizes or minimizes an appropriate criterion. Two subsets are
obvious – the best single variable and the complete set of variables. The problem lies in selecting
an intermediate subset that is better than both of these extremes. Therefore, the issue is how to find
the necessary variables
among the complete set of variables by deleting both irrelevant variables
(variables not affecting the dependent variable), and redundant variables
(variables not adding
anything to the dependent variable). 
I review five frequently used variable selection methods. These everyday
methods are found in
major statistical software packages.  The test-statistic for the first three methods uses either the
F statistic for a continuous dependent variable, or the G statistic for a binary dependent variable.
The test-statistic for the fourth method is either R-squared for a continuous dependent variable, or
the Score statistic for a binary dependent variable. The last method uses one of the criteria: R-
squared, adjusted R-squared, Mallows’ Cp
1. Forward Selection (FS) - This method adds variables to the model until no remaining
variable (outside the model) can add anything significant to the dependent variable. FS begins with no variable in the model. For each variable, the test-statistic (TS), a measure of the variable’s contribution to the model, is calculated. The variable with the largest TS value that is greater than a preset value C is added to the model. Then the test-statistics is calculated again for the variables still remaining, and the evaluation process is repeated. Thus, variables are added to the model one by one until no remaining variable produces a TS value that is greater than C. Once a variable is in the model, it remains there.
2. Backward Elimination (BE) - This method deletes variables one by one from the model until
all remaining variables are contribute something significant to the dependent variable. BE begins with a model which includes all variables. Variables are then deleted from the model one by one until all the variables remaining in the model have TS values greater than C. At each step, the variable showing the smallest contribution to the model (i.e., with the smallest TS value that is less than C) is deleted.
3. Stepwise (SW) - This method is a modification of the forward selection approach and differs
in that variables already in the model do not necessarily stay. As in Forward Selection, SW adds variables to the model one at a time. Variables that have a TS value greater than C are added to the model. After a variable is added, however, SW looks at all the variables already included to delete any variable that does not have a TS value greater C.
4. R-squared (R-sq) - This method finds several subsets of different sizes that best predict the
dependent variable. R-sq finds subsets of variables that best predict the dependent variable based on the appropriate TS. The best subset of size k has the largest TS value. For a continuous dependent variable, TS is the popular measure R-squared, the coefficient of multiple determination, which measures the proportion of the explained
variance in the dependent variable by the multiple regression. For a binary dependent variable, TS is the theoretically correct but less-known Score statistic . R-sq finds the best one-variable model, the best two-variable model, and so forth. However, it is unlikely that one subset will stand out as clearly being the best, as TS values are often bunched together. For example,
they are equal in value when rounded at the, say, third place after the decimal point.  R-sq generates a number of subsets of each size, which allows the user to select a subset, possibly using nonstatistical conditions.
5. All-possible Subsets – This method builds all one-variable models, all two-variable models,
and so on, until the last all-variable model is generated. The method requires a powerful
computer (because a lot of models are produced), and selection of any one of the criteria: R-
squared, adjusted R-squared, Mallows’ Cp
III. Weakness in the Stepwise An ideal variable selection method for regression models would find one or more subsets of variables that produce an optimal
model. [22.1] Its objectives are that the resultant models include: accuracy, stability, parsimony, interpretability, and avoid bias in drwaing inferences. Needless to say, the above methods do not satisfy most of these goals. Each method has at least one drawback specific to its selection criterion. In addition to the nine weaknesses mentioned above, I itemize a complied list of weaknesses of the most popular Stepwise
method. [ 23]
1. It yields R-squared values that are badly biased high.
2. The F and chi-squared tests quoted next to each variable on the printout do not have the
3. The method yields confidence intervals for effects and predicted values that are falsely nar-
4. It yields p-values that do not have the proper meaning and the proper correction for them is
5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining
6. It has severe problems in the presence of collinearity.
7. It is based on methods (e.g., F tests) that were intended to be used to test pre-specified hy-
8. Increasing the sample size doesn't help very much.
9. It allows us to not think about the problem.
11. The number of candidate predictor variables affected the number of noise variables that
I add to the tally of weaknesses by stating common
weaknesses in regression models, as well as those specifically related to OLS regression model and LRM:
The everyday variable selection methods in regression model typically results in models having too many variables, an indicator of overfitted. The prediction errors, which are inflated by out-liers, are not stable. Thus, model implementation results in unsatisfactory performance. For or-
dinary least squares regression, it is well known in the absence of normality or absence of line-arity assumption or outlier(s) presence in the data, variable selection methods poorly perform. For logistic regression, the reproducibility of the computer-automated variable-selection models is unstable and not reproducible. The variables selected as predictor variables in the models are sensitive to unaccounted for sample variation in the data.
Given the litany of weaknesses cited, the lingering question is: Why do statisticians use variable selection methods to build regression models? To paraphrase Mark Twain: “Get your [data] first, and then you can distort them as you please.” [23.1] The author’s answer is: “Modelers use vari-able selection methods every day because they can.”
As a counterpoint to the absurdity of “be-cause they can,” I enliven anew Tukey’s solution of Natural
Seven-step Cycle of Statistical Model-ing and Analysis to defining a substantially performing regression model. I feel that newcomers to Tukey’s EDA need the Seven-step Cycle introduced within the narrative of Tukey’s analytic phi-losophy. Accordingly, I enfold the solution with front and back matter – The Essence of EDA, and The EDA School of Thought, respectively. I delve into the trinity of Tukey‘s masterwork; but first I discuss, below, an enhanced variable selection method, for which I might be the only exponent for appending this method to the current baseless arsenal of variable selection. IV. Enhanced Variable Selection Method In lay terms, the variable-selection problem in regression can be stated:
Find the best combination of the original variables
to include in a model. The variable selection method neither states nor implies
that it has an attribute to concoct new variables stirred up by mixtures of the original variables
The attribute – data mining – is either overlooked, perhaps, because it is reflective of the simple-mindedness of the problem-solution at the onset, or is currently sidestepped as the problem is too difficult to solve. A variable selection method without a data mining attribute obviously hits a wall
, which beyond it would otherwise increase the predictiveness of the technique. In today’s terms, the variable selection methods are without
data mining capability. They cannot dig the data for the mining of potentially important new variables.
(This attribute, which has never surfaced during my literature search, is a partial mystery to me.) Accordingly, I put forth a definition of
an enhanced variable selection method:
An enhanced variable selection method is one that identifies a subset that consists of the original variables and
data-mined variables, whereby the latter are a result of the data-mining attribute of the method itself.
The following five discussion-points clarify the attribute-weakness, and illustrate the concept of an enhanced variable selection method.
1. Consider the complete set of variables, X1, X2, ., X10. Any of the current variable selection
in use finds the best combination of the original variables (say X1, X3, X7, X10); but, it can never automatically transform a variable (say transform X1 to log X1) if it were needed to
increase the information content
) of that variable. Furthermore, none of
the methods can generate a re-expression of the original variables (perhaps X3/X7) if the
constructed variable, structure, were to offer more predictive power than the original
component variables combined. In other words, current variable selection methods cannot
find an enhanced subset
, which needs, say, to include transformed and re-expressed
variables (possibly X1, X3, X7, X10, logX1
). A subset of variables without the
potential of new structure offering more predictive power clearly limits the modeler in
building the best model.
2. Specifically, the current variable selection methods fail to identify structure of the types
discussed here. Transformed variables
with a preferred
shape. A variable selection procedure should have the ability to transform an individual variable, if necessary, to induce symmetric distribution. Symmetry is the preferred shape of an individual variable. For example, the workhorse of statistical measures – the mean and variance – is based on symmetric distribution. Skewed distribution produces inaccurate estimates for means, variances, and related statistics, such as the correlation coefficient. Symmetry facilitates the interpretation of the variable’s effect in an analysis. Skewed distribution are difficult to examine because most of the observations are bunched together at one end of the distribution. Modeling and analyses based on skewed distributions typically provide a model with doubtful interpretability and questionable results.
3. The current variable selection method also should have the ability to straighten
relationships. A linear or straight-line relationship is the preferred
shape when considering two variables. A straight-line relationship between independent and dependent variables is an assumption of the popular statistical linear regression models (e.,g., OLS regression and LRM). (Remember that, a linear model is defined as a sum of weighted variables, such as Y= b0 + b1*X1 + b2*X2 + b3*X3.)  Moreover, straight-line relationships among all the independent variables is a desirable property
.  In brief, straight-line relationships are easy to interpret: A unit of increase in one variable produces an expected constant increase in a second variable.
4. Constructed variables
from the original variables using simple arithmetic functions
variable selection method should have the ability to construct simple re-expressions of the the original variables. Sum, difference, ratio, or product variables potentially offer more information than the original variables themselves. For example, when analyzing the efficiency of an automobile engine, two important variables are miles traveled and fuel used (gallons). However, we know the ratio variable of miles per gallon is the best variable for assessing the engine’s performance.
5. Constructed variables
from the original variables using a set of functions
trigonometric, and/or Boolean functions). A variable selection method should have the ability to construct complex, re-expressions with mathematical functions that capture the complex relationships in the data, thusly, potentially offer more information than the original variables themselves. In an era of data warehouses and the internet, big data consisting of hundreds of thousands-to-millions of individual records and hundreds-to-thousands of variables are commonplace. Relationships among many variables produced by
so many individuals are sure to be complex, beyond the simple straight-line pattern. Discovering the mathematical expressions of these relationships, although difficult although practical guidance exist, should be the hallmark of a high-performance variable selection method. For example, consider the well-known relationship among three variables: The lengths of the three sides of a right triangle. A powerful variable selection procedure would identify the relationship among the sides, even in the presence of measurement error: The longer side (diagonal) is the square root of the sum of squares of the two shorter sides.
In sum, the attribute-weakness implies: A variable selection method should have the ability of generating an enhanced subset of candidate predictor variables.
V. Alternative Methods are Available Given the literature on the problematic methods of variable selection in regression, I seek to com-pile a list – by soliciting the readership of this article – of reliable alternative approaches. The read-ership’s willingness to share their particular procedure or set of procedures of variable selection will add the much-needed counter-body of negative entries in the literature of variable selection in regression. The final list will be distributed to all contributors for proofing their individual meth-odological essay (to include assumptions, rules and steps involved, as well as, description and re-sults of validation of the approach). To have your alternative method(s) be part of the new
unprob-lematic variable-selection literature, please click VI. Conclusion Finding the best possible subset of variables to put in a model has been a frustrating exercise. Many variable selection methods exist. Many statisticians know them, but few know they produce poorly performing models. The wanting variable selection methods are a miscarriage of statistics because they are developed by debasing sound statistical theory into a misguided pseudo-theoretical foundation. I review the five widely used variable selection methods, among which stepwise is arguably the most popular, itemize some of their weaknesses and answer why they are used. Then, I seek to compile a list – by soliciting the readership of this article – of reliable alterna-tive approaches. The readership’s willingness to share their particular procedure or set of proce-dures of variable selection will add the much-needed counter-body of negative entries in the litera-ture of variable selection in regression.
1. Classical underlying assumptions, http://en.wikipedia.org/wiki/Regression_analysis. 1.1 The variable selection methods do not include the new breed of methods that have data mining capability. 2. Tukey, J.W., The Exploratory Data Analysis
, Addison-Wesley, Reading, MA, 1977. 3. Miller, A., J., Subset Selection in Regression
, Chapman and Hall, NY, 1990, pp. iii-x. 3.1 Statistica-Criteria-Supported-by-SAS.pdf (http://www.geniq.net/res/Statistical-Criteria-Supported-by-SAS.pdf)
4. Roecker, Ellen B. 1991. Prediction error and its estimation for subset-selected models. Technometrics
33, 459-468. 5. Copas, J. B. 1983. Regression, prediction and shrinkage (with discussion). Journal of the Royal Statistical Society
B 45, 311-354. 6. Henderson, H. V., Velleman, P. F., 1981. Building multiple regression models interactively. Biometrics
37, 391-411. 7. Altman, D. G., Andersen, P. K. 1989. Bootstrap investigation of the stability of a Cox regression model. Statistics in Medicine,
771-783. 8. Derksen, S., Keselman, H. J., 1992. Backward, forward and stepwise automated subset selection algorithms. British Journal of Mathematical and Statistical Psychology,
265-282. 9. Judd, C. M., McClelland, G. H. 1989. Data analysis: A model comparison approach.
Harcourt Brace Jovanovich, New York. 10. Bernstein, I., H., 1988. Applied Multivariate Analysis,
Springer -Verlag, New York. 11. Comment without an attributed citation: Frank Harrell, Vanderbilt University School of Medi-cine, Department of Biostatistics, Professor of Biostatistics, and Department Chair. 12. Kashid, D. N., Kulkarni, S. R. 2002. A More General Criterion for Subset Selection in Multiple Linear Regression. Communication in Statistics-Theory & Method
, 31(5), 795-811. 13. Tibshirani, R. 1996. Regression shrinkage and selection via the Lasso. J. Royal Statistic. Soci-ety B
., Vol. 58, No. 1, 267-288. 14. Ratner, B., 2003. Statistical Modeling and Analysis for Database Marketing: Effective Tech-niques for Mining Big Data
, CRC Press, Boca Raton, Chapter 15, which presents the GenIQ Model (www.GenIQModel.com). 15. Chen, Shyi-Ming, Shie, Jen-Da. 1995. A New Method for Feature Subset Selection for Handling Classification Problems
, ISBN: 0-7803-9159-4. 16. SAS Proc Reg Variable Selection Methods.pdf 17. Comment without an attributed citation: In 1996, Tim C. Hesterberg, Research Scientist at In-sightful Corporation, asked Brad Efron for the most important problems in statistics, fully expect-ing the answer to involve the bootstrap, given Efron’s status as inventor. Instead, Efron named a single problem, variable selection in regression. This entails selecting variables from among a set of candidate variables, estimating parameters for those variables, and inference – hypotheses tests, standard errors, and confidence intervals. 18. Other criteria are based on information theory, and bayesian rules. 19. Dash, M., and Liu, H. 1997. Feature Selection for Classification, Intelligent Data Analysis
, Elsevier Science Inc. 20. SAS/STAT Manual. See PROC REG, and PROC LOGISTIC 21. R-squared theoretically is not the appropriate measure for a binary dependent variable. However, many analysts use it with varying degrees of success. 21.1 Mark Twain quotation: “Get your facts first, then you can distort them as you please.” http://thinkexist.com/quotes/mark_twain/ 22. For example, consider two TS values: 1.934056 and 1.934069. These values are equal when rounding occurs at the third place after the decimal point: 1.934. 22.1 Even if there were perfert variable selection method, it is unrelastic to believe there is a unique best subset of variables. 23. Comment without an attributed citation: Frank Harrell, Vanderbilt University School of Medi-cine, Department of Biostatistics, Professor of Biostatistics, and Department Chair.
24. The weights or coefficients (b0, b1, b2 and b3) are derived to satisfy some criterion, such as minimize the mean squared error used in ordinary least-square regression, or minimize the joint probability function used in logistic regression. 25. Fox, J., 1997. Applied Regression Analysis, Linear Models, and Related Methods, Sage Publications, California.
SECTION 1 – Chemical Product and Company Identification CATALYST SYSTEMS CHEMTREC: 1-800-424-9300 U S Chemical & Plastics Alco Industries Companies PO Box 88 2290 Zimmerman Rd SE Gnadenhutten, OH 44629 PH: 740-254-4311 PRODUCT NAME: CREAM HARDENER PRODUCT CODE: 27640/White, 27641/Red, 27642/Green, 27643/Blue, 28050/Black, 28070/Lt. Red
A GYÓGYSZER MEGNEVEZÉSE Elidel 10 mg/g krém 2. MINİSÉGI ÉS MENNYISÉGI ÖSSZETÉTEL Egy gramm krém 10 mg pimekrolimusz hatóanyagot tartalmaz. A segédanyagok teljes listáját lásd a 6.1 pontban. 3. GYÓGYSZERFORMA Fehéres, homogén krém. 4. KLINIKAI JELLEMZİK Terápiás javallatok A 2 éves vagy ennél idısebb, enyhe vagy középsúl