Random Forest from Scratch - Explanation and Implementation [Python]

Random Forest from scratch [Python]

Random Forest is an ensemble model which means it uses a collection of predictors to predict a class(classification) or a value(regression). The predictors used in this case are Decision Trees. The reasoning behind this type of modelling is that several weak learners(individual Decision Trees) can be used to produce a strong learner. These learners should be diverse in terms of training data and training variables to effectively learn the patterns in the data and avoid overfitting as is the case in a Decision Tree. To learn how to implement a Decision Tree refer to this blog.

1) Architecture

The red and blue balls represent the two classes for a binary classification problem.

2) Weak Learners

We know that a Decision Tree is able to completely learn(fit) the training data but it suffers from overfitting when evaluated on the test data. To make it a weak learner we shall:

(i) Train the DTs on a small subset of the training data. We will pick randomly few data points with replacement from the training set for each DT. This is called Bagging. It will lead to diverse DTs with some data points common to one another and some exclusive to one.

(ii) Reduce the maximum depth of the DTs. This generally leads to underfitting which is what we want.

(iii) The number of features to be considered while making a decision(node split) in the DT should be a subset of the total features. For eg, if we have 16 features, for making the best split when training the DT we would consider let's say only 4 random features. This will also underfit the DT.

3) Modifying previous functions

We'll take up a classification problem in this blog. To incorporate points (ii) and (iii) we make the following modifications to the decision_tree_classifier function:

Copied!

def decision_tree_classifier(X, y, model, max_depth=np.nan, max_features=None):
    if gini_impurity(y, [])==0.0 or not max_depth:
        element = Node()
        # metric loss 0 indicates leaf node with only one class present
        element.metric_value = gini_impurity(y, [])
        values, counts = np.unique(y, return_counts=True)
        element.output = values[np.argmax(counts)]
        return element

    features = np.arange(0, X.shape[-1])
    if max_features!=None:
        np.random.shuffle(features)
        features = features[:max_features]

    min_loss = 1
    min_list = []
    for feature in features:
        for point in set(X[:, feature]):
            # calculates loss by using the left and right split
            loss = gini_impurity(y[X[:, feature]<=point], y[X[:, feature]>point])

            if loss<=min_loss:
                if loss!=min_loss:
                    min_list.clear()
                # stores the feature value having the same least loss
                min_list.append((point, feature))
                min_loss = loss

    # choosing the best split at random
    np.random.shuffle(min_list)
    selected_point, selected_feature_index = min_list[0]

    lower = selected_point
    upper = max(X[:, selected_feature_index])
    for el in set(X[:, selected_feature_index]):
        if lower<el<upper:
            upper = el
    selected_point = (lower+upper)/2

    # creating left and right split using the condition learned
    left_mask, right_mask = X[:, selected_feature_index]<=selected_point, X[:, selected_feature_index]>selected_point
    if sum(left_mask):
        model.feature_index = selected_feature_index
        model.feature_value = selected_point
        # we are storing the gini index of the whole node
        model.metric_value = gini_impurity(y, [])
        # recursive call
        model.left = decision_tree_classifier(X[left_mask], y[left_mask], Node())
        # storing the class with most data points as output of the node
        values, counts = np.unique(y, return_counts=True)
        model.output = values[np.argmax(counts)]

    if sum(right_mask):
        model.feature_index = selected_feature_index
        model.feature_value = selected_point
        # we are storing the gini index of the whole node
        model.metric_value = gini_impurity(y, [])
        # recursive call
        model.right = decision_tree_classifier(X[right_mask], y[right_mask], Node())
        # storing the class with most data points as output of the node
        values, counts = np.unique(y, return_counts=True)
        model.output = values[np.argmax(counts)]

    return model

4) Dataset

The dataset that we will use contains 14980 data points with 14 features. We will split into train and test set in 80:20 ratio. The information about the dataset and the dataset itself can be downloaded from here.

Copied!

import pandas as pd

data = pd.read_csv('EEG_Eye_State_Classification.csv').dropna(how='any')
X_data = data.drop(columns='eyeDetection').to_numpy()
Y_data = data['eyeDetection'].to_numpy()

# You can use sklearn's train_test_split for better splits
idx = len(X_data)*80//100
X_train, X_test, Y_train, Y_test = X_data[:idx], X_data[idx:], Y_data[:idx], Y_data[idx:]

5) Implementation

Now we incorporate point (i). The parameters we will use are num_trees=100 (number of trees), max_depth=5 and max_features=4. The data points for each tree will be 200.

Copied!

num_trees = 100
num_row = 200
models = []
weights = []
row_idx = [x for x in range(X_train.shape[0])]

for i in range(num_trees):
    # Shuffling the bag
    np.random.shuffle(row_idx)
    model = Node()
    decision_tree_classifier(X_train[row_idx[:num_row]], Y_train[row_idx[:num_row]], model, max_depth=5, max_features=4)
    models.append(model)

6) Evaluation

We have with ourselves a collection of 200 Decision Trees. Now we will do Voting. Let's consider a simple case where every DT has the same weightage, then we will choose the output which the majority of DTs agree with.

Copied!

from sklearn.metrics import accuracy_score

y_pred = []
for el in X_test:
    # Keys 0 and 1 are the two classes and Values represent the votes received
    tmp_pred = {0: 0, 1: 0}
    for i, model in enumerate(models):
        tmp_pred[model.take_decision(el)] += 1
    # Choosing the class with the majority of votes
    y_pred.append(np.argmax(list(tmp_pred.values())))
Y_pred = np.array(y_pred)
print(f'Accuracy:\n{accuracy_score(Y_test, Y_pred)}')

7) Conclusion

If you compare the accuracy of a Decision Tree and Random Forest built with the same parameters you can clearly see the Random Forest accuracy is much better. This implementation is meant to just give an understanding of the algorithm of Random Forests. The techniques used in this blog can be further optimized to achieve even better accuracy.

Oblivion Element

Blogs

Tutorials

Learning

Random Forest from scratch [Python]

1) Architecture

2) Weak Learners

3) Modifying previous functions

4) Dataset

5) Implementation

6) Evaluation

7) Conclusion