Gini Impurity [Python]
Gini Impurity is a loss function commonly used in Decision Trees to get the condition for best split of nodes into two child nodes.
1) Theory
The gini impurity is calculated using the below formula:
∑
where p(xi) is the probability of the presence of the class
xi in the dataset X.
The value of G(X) lies between 0 and 1, G(X) ∈ [0, 1).
The lower the value the more homogenous is the dataset i.e. more samples of the same class
are present. Since in decision trees we split the node into two nodes, we calculate
the gini impurity of both splits and take it's weighted average. The feature value with the lowest
value is selected for the split. Let's assume the dataset X is split into two
dataset X1 containing w1 samples and
X2 containing w2 samples, then the gini
impurity is calculated as follows:
2) Implementation
def calculate_prob(points_list, point):
num_points = len(points_list)
count = sum([1 for pt in points_list if pt==point])
return count/num_points
def gini_impurity(left_arr, right_arr):
sum_prob_left = 0
for el in set(left_arr):
sum_prob_left += calculate_prob(left_arr, el)**2
sum_prob_right = 0
for el in set(right_arr):
sum_prob_right += calculate_prob(right_arr, el)**2
return (len(left_arr) * (1 - sum_prob_left) + len(right_arr) * (1 - sum_prob_right))/(len(left_arr) + len(right_arr))