
Print "Scores for X0, X1, X2:", map(lambda x:round (x,3), Rf = RandomForestRegressor(n_estimators=20, max_features=2) In the following example, we have three correlated variables \(X_0, X_1, X_2\), and no noise in the data, with the output variable simply being the sum of the three features: The effect of this phenomenon is somewhat reduced thanks to random selection of features at each node creation, but in general the effect is not removed completely. But when interpreting the data, it can lead to the incorrect conclusion that one of the variables is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the response variable. This is not an issue when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. As a consequence, they will have a lower reported importance. But once one of them is used, the importance of others is significantly reduced since effectively the impurity they can remove is already removed by the first feature. Secondly, when the dataset has two (or more) correlated features, then from the point of view of the model, any of these correlated features can be used as the predictor, with no concrete preference of one over the others. Firstly, feature selection based on impurity reduction is biased towards preferring variables with more categories (see Bias in random forest variable importance measures). There are a few things to keep in mind when using the impurity based ranking. Print sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), #Load boston housing dataset as an example This is the feature importance measure exposed in sklearn’s Random Forest implementations ( random forest classifier and random forest regressor).įrom sklearn.ensemble import RandomForestRegressor For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. The measure based on which the (locally) optimal condition is chosen is called impurity. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. Random forest consists of a number of decision trees. They also provide two straightforward methods for feature selection: mean decrease impurity and mean decrease accuracy. Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness and ease of use. In this post, I’ll discuss random forests, another popular approach for feature ranking. In my previous posts, I looked at univariate feature selection and linear models and regularization for feature selection.
