Computer Science Notes

Computer Science Notes

CS Notes is a simple blog to keep track about CS-related stuff I consider useful.

06 Aug 2020

Features Selection Resources

by Harpo MAxx

An usual method for feature selection consists of analizing the coefficient of linear regression, another is the use of Random Forest variable importance features. Recently, as part of the Dreams’ Tumor Deconvolution challenge, our team applied Random Forest (Breiman 2001) for feature selection. Despite the simplicity of using variable importance function based on GINI index, there are some considerations. The fact is that GINI index for feature importance is biased to continuous or high-cardinality categorical variables (unless you use some sort of normalization).

Here some articles and links about the topic:

  1. A simple medium post explaining GINI and Information entropy. Two well-known metrics used in CARTs and RANDOM FORESTS algorithms for feature selection.

  2. An excelent article describing with example the problem of GINI index and the need for applying permutation importance measure. A more resource-consuming measure already described by Breiman. A MUST READ. More information here and inside the article.

Also (at some point) related with feature selection in linear regressors, the article from John Vanhove about why Collinearity isn’t a disease that needs curing, is another piece of information that I would read. In the article, John explains why Collinearity could affect the estimation of the regression coefficients. But there are times when it is not a problem and in those cases when it is, what are the possible ways to deal with it.

Finally, a post about the use of PCA for selecting the most important features for each Principal Component.