Gini Impurity - Explanation and Implementation [Python]

Gini Impurity [Python]

Gini Impurity is a loss function commonly used in Decision Trees to get the condition for best split of nodes into two child nodes.

1) Theory

The gini impurity is calculated using the below formula:

G(X) = 1 -

∑

i=1

p(x

)

where p(x_i) is the probability of the presence of the class x_i in the dataset X.

The value of G(X) lies between 0 and 1, G(X) ∈ [0, 1). The lower the value the more homogenous is the dataset i.e. more samples of the same class are present. Since in decision trees we split the node into two nodes, we calculate the gini impurity of both splits and take it's weighted average. The feature value with the lowest value is selected for the split. Let's assume the dataset X is split into two dataset X₁ containing w₁ samples and X₂ containing w₂ samples, then the gini impurity is calculated as follows:

G'(X) =

w₁.G(X₁) + w₂.G(X₂)

w₁ + w₂

2) Implementation

Copied!

def calculate_prob(points_list, point):
    num_points = len(points_list)
    count = sum([1 for pt in points_list if pt==point])
    return count/num_points

def gini_impurity(left_arr, right_arr):
    sum_prob_left = 0
    for el in set(left_arr):
        sum_prob_left += calculate_prob(left_arr, el)**2

    sum_prob_right = 0
    for el in set(right_arr):
        sum_prob_right += calculate_prob(right_arr, el)**2

    return (len(left_arr) * (1 - sum_prob_left) + len(right_arr) * (1 - sum_prob_right))/(len(left_arr) + len(right_arr))

Oblivion Element

Blogs

Tutorials

Learning

Gini Impurity [Python]

1) Theory

2) Implementation