Feature Selection Strategies
Selecting the attributes used in your model is a fundamental aspect of the machine learning workflow. It would be really great if we just could use all the information from our problem without performing any kind of selection. Unfortunately, the intrinsic definition of information is basically the problem. We don’t have information, all we have is just data. Noisy, dirty and incomplete data. So the question is: From all the data we have, what are the variables that provide us valuable information and what are the variables that are basically garbage?
A classical article discussing aspects of feature selection was written by Guyon et al. in 2003. With more than 16000 citations, this article is definitively A MUST READ!. The Applied Predictive Modeling book is another good source of information about feature selection. The topic is discussed in Chapter 19.
There basically three categories under features selection algorithms can be included. This is a well known-classification, however I stolen this classification from this excelent article about feature selection.
Filter based
In filter based feature selection, we specify some metric and based on that metric we filter features. Two well-known approaches falling under this category are Pearson’s correlation and chi-square.
When using correlation, we simply filter/remove highly correlated features considering their absolute value. Here we can left only one of the correlated features or combine all the correlated features in one using something like \(new\_feature=0.5*corfeat1+0.5*corfeat2+corfeat3*0.5...\).
In the case of chi-square, the idea is to apply an statistical test to select those features that have the strongest relationship with the output variable (they are highly dependent on the response). The same idea can be applied to any other statistical test. The sckit-learn library provide a wrapper function selectKBest that facilitates the use of ANOVA, chi-squared tests among others. Notice this is an univariate feature selection approach. In other words, the relation between variables are not considered. Consequently, there could be a case where two variables do not show dependence with response when analyzed separately, but when both are present in our data, they results in fundamental features for the problem (See Guyon et al, 2003).
Another interesting approach for selecting features is based on the application of PCA. Here the idea is to establish the importance of each feature based on the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance). Let’s say hypothetically we have 3 PCs (Principal Components) that explain more than 70% of the variability of our data. We can choose the features with the higher magnitude on the most important PCs. I have my personal concerns about using PCA for feature selection since the if you don’t have a high variability explained in only a few PCs, it would be difficult to measures which are the best features on each PC. Anyway, here it is a post discussing the idea and a research article using this approach.
Wrapper-based
These methods consider the selection of a set of features as a search problem. Notice that under these strategies types, some sort of model/algorithm is required for conducting the search process. The idea is to use the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute. The classic strategy following this approach is the so called Recursive Feature Elimination (RFE). The goal of RFE is to select features by recursively considering smaller and smaller sets of features.(see Isabelle Guyon’s paper about RFE). A multitude of modern search procedures can be applied to the feature selection problem to reduce the computational requirements of the search process. Two of the most common procedure are discussed in the excellent book (Applied Predictive Modeling) are Simulated Annealing and Genetic Algorithms.
As mentioned in wikipedia, the two main disadvantages of these methods are:
- The increasing overfitting risk when the number of observations is insufficient.
- The significant computation time when the number of variables is large.
To deal with 1. it is important to follow an appropriate machine learning methodology. Using resampling as well as an independent test evaluation (train-CV and test). In the case of 2. some sort of distributed computation techniques will be required.
Embedded Based:
Embedded methods use algorithms that have built-in feature selection methods. Two well-known algorithms that include some sort of internal feature selection are [Lasso Regression](https://en.wikipedia.org/wiki/Lasso_(statistics), a sort of linear regression a with penalization term popularized by Robert Tibshirani in the late 90s and the good ol’ Random Forests.
In the case of Lasso Regression, the algorithm includes a penalization term that can reduce some coefficient values to zero. As a result, models generated from the lasso consider only the most important variables and are generally much easier to interpret. So we can extract the variables with coefficients different from zero to perform a straightforward feature selection.
Random Forests (RF) apply the mean decrease in impurity (MDI or gini importance) mechanism. The mean decrease in impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within RFs. Once the model was fit, we can extract the gini importance index from all the variables and select the top N for our model.
Extra Bonus
This excellent article (I’ve already mentioned it in other posts), discusses some drawbacks with RF importance mechanism, and mention an alternative WRAPPER approach for measuring the importance of not only RF but also any other machine learning algorithm. The main idea is to randomly permute the predictor colummns. The importance of that feature is the difference between the baseline and the drop in overall accuracy or any other considered metric caused by permuting the column. The authors of the previous article have extended this idea and proposed a non-parametric approach for feature selection based on semi partial curves.
Recently, I have found other articles discussing the importance of permutation for feature selection.
This particular article compares both approaches and point me to Boruta, an R library for performing feature selection based on permutation (also available for Python).
In this article, the authors discuss more formally the problems and limitation behind the MDI approach.