Gini Impurity [Python]

Gini Impurity is a loss function commonly used in Decision Trees to get the condition for best split of nodes into two child nodes.

1) Theory

The gini impurity is calculated using the below formula:

G(X) = 1 - 
N

i=1
p(x
i
)
2

where p(xi) is the probability of the presence of the class xi in the dataset X.

The value of G(X) lies between 0 and 1, G(X) ∈ [0, 1). The lower the value the more homogenous is the dataset i.e. more samples of the same class are present. Since in decision trees we split the node into two nodes, we calculate the gini impurity of both splits and take it's weighted average. The feature value with the lowest value is selected for the split. Let's assume the dataset X is split into two dataset X1 containing w1 samples and X2 containing w2 samples, then the gini impurity is calculated as follows:

G'(X) = 
w1.G(X1) + w2.G(X2)
w1 + w2

2) Implementation

Copied!

def calculate_prob(points_list, point):
    num_points = len(points_list)
    count = sum([1 for pt in points_list if pt==point])
    return count/num_points

def gini_impurity(left_arr, right_arr):
    sum_prob_left = 0
    for el in set(left_arr):
        sum_prob_left += calculate_prob(left_arr, el)**2

    sum_prob_right = 0
    for el in set(right_arr):
        sum_prob_right += calculate_prob(right_arr, el)**2

    return (len(left_arr) * (1 - sum_prob_left) + len(right_arr) * (1 - sum_prob_right))/(len(left_arr) + len(right_arr))