Random Forest from scratch [Python]
Random Forest is an ensemble model which means it uses a collection of predictors to predict a class(classification) or a value(regression). The predictors used in this case are Decision Trees. The reasoning behind this type of modelling is that several weak learners(individual Decision Trees) can be used to produce a strong learner. These learners should be diverse in terms of training data and training variables to effectively learn the patterns in the data and avoid overfitting as is the case in a Decision Tree. To learn how to implement a Decision Tree refer to this blog.
1) Architecture
The red and blue balls represent the two classes for a binary classification problem.
2) Weak Learners
We know that a Decision Tree is able to completely learn(fit) the training data but it suffers from overfitting when evaluated on the test data. To make it a weak learner we shall:
(i) Train the DTs on a small subset of the training data. We will pick randomly few data points with replacement from the training set for each DT. This is called Bagging. It will lead to diverse DTs with some data points common to one another and some exclusive to one.
(ii) Reduce the maximum depth of the DTs. This generally leads to underfitting which is what we want.
(iii) The number of features to be considered while making a decision(node split) in the DT should be a subset of the total features. For eg, if we have 16 features, for making the best split when training the DT we would consider let's say only 4 random features. This will also underfit the DT.
3) Modifying previous functions
We'll take up a classification problem in this blog. To incorporate points (ii) and (iii) we make the following modifications to the decision_tree_classifier function:
def decision_tree_classifier(X, y, model, max_depth=np.nan, max_features=None):
if gini_impurity(y, [])==0.0 or not max_depth:
element = Node()
# metric loss 0 indicates leaf node with only one class present
element.metric_value = gini_impurity(y, [])
values, counts = np.unique(y, return_counts=True)
element.output = values[np.argmax(counts)]
return element
features = np.arange(0, X.shape[-1])
if max_features!=None:
np.random.shuffle(features)
features = features[:max_features]
min_loss = 1
min_list = []
for feature in features:
for point in set(X[:, feature]):
# calculates loss by using the left and right split
loss = gini_impurity(y[X[:, feature]<=point], y[X[:, feature]>point])
if loss<=min_loss:
if loss!=min_loss:
min_list.clear()
# stores the feature value having the same least loss
min_list.append((point, feature))
min_loss = loss
# choosing the best split at random
np.random.shuffle(min_list)
selected_point, selected_feature_index = min_list[0]
lower = selected_point
upper = max(X[:, selected_feature_index])
for el in set(X[:, selected_feature_index]):
if lower<el<upper:
upper = el
selected_point = (lower+upper)/2
# creating left and right split using the condition learned
left_mask, right_mask = X[:, selected_feature_index]<=selected_point, X[:, selected_feature_index]>selected_point
if sum(left_mask):
model.feature_index = selected_feature_index
model.feature_value = selected_point
# we are storing the gini index of the whole node
model.metric_value = gini_impurity(y, [])
# recursive call
model.left = decision_tree_classifier(X[left_mask], y[left_mask], Node())
# storing the class with most data points as output of the node
values, counts = np.unique(y, return_counts=True)
model.output = values[np.argmax(counts)]
if sum(right_mask):
model.feature_index = selected_feature_index
model.feature_value = selected_point
# we are storing the gini index of the whole node
model.metric_value = gini_impurity(y, [])
# recursive call
model.right = decision_tree_classifier(X[right_mask], y[right_mask], Node())
# storing the class with most data points as output of the node
values, counts = np.unique(y, return_counts=True)
model.output = values[np.argmax(counts)]
return model
4) Dataset
The dataset that we will use contains 14980 data points with 14 features. We will split into train and test set in 80:20 ratio. The information about the dataset and the dataset itself can be downloaded from here.
import pandas as pd
data = pd.read_csv('EEG_Eye_State_Classification.csv').dropna(how='any')
X_data = data.drop(columns='eyeDetection').to_numpy()
Y_data = data['eyeDetection'].to_numpy()
# You can use sklearn's train_test_split for better splits
idx = len(X_data)*80//100
X_train, X_test, Y_train, Y_test = X_data[:idx], X_data[idx:], Y_data[:idx], Y_data[idx:]
5) Implementation
Now we incorporate point (i). The parameters we will use are num_trees=100 (number of trees), max_depth=5 and max_features=4. The data points for each tree will be 200.
num_trees = 100
num_row = 200
models = []
weights = []
row_idx = [x for x in range(X_train.shape[0])]
for i in range(num_trees):
# Shuffling the bag
np.random.shuffle(row_idx)
model = Node()
decision_tree_classifier(X_train[row_idx[:num_row]], Y_train[row_idx[:num_row]], model, max_depth=5, max_features=4)
models.append(model)
6) Evaluation
We have with ourselves a collection of 200 Decision Trees. Now we will do Voting. Let's consider a simple case where every DT has the same weightage, then we will choose the output which the majority of DTs agree with.
from sklearn.metrics import accuracy_score
y_pred = []
for el in X_test:
# Keys 0 and 1 are the two classes and Values represent the votes received
tmp_pred = {0: 0, 1: 0}
for i, model in enumerate(models):
tmp_pred[model.take_decision(el)] += 1
# Choosing the class with the majority of votes
y_pred.append(np.argmax(list(tmp_pred.values())))
Y_pred = np.array(y_pred)
print(f'Accuracy:\n{accuracy_score(Y_test, Y_pred)}')
7) Conclusion
If you compare the accuracy of a Decision Tree and Random Forest built with the same parameters you can clearly see the Random Forest accuracy is much better. This implementation is meant to just give an understanding of the algorithm of Random Forests. The techniques used in this blog can be further optimized to achieve even better accuracy.