Compute Gini Inequality Coefficient Given Data Vector (One Variable) • REconTools

View a vector positive vectors as population, calculate discrete GINI inequality measure. This file works out how the ff_dist_gini_vector_pos function works from Fan’s REconTools Package. See also the fs_gini_disc from R4Econ. See also the ff_dist_gini_random_var function for GINI for discrete random variables.

There is an vector values (all positive). This could be height information for N individuals. It could also be income information for N individuals. Calculate the GINI coefficient treating the given vector as population. This is not an estimation exercise where we want to estimate population gini based on a sample. The given array is the population. The population is discrete, and only has these N individuals in the length n vector.

See the formula below, note that when the sample size is small, there is a limit to inequality using the formula defined below given each \(N\). So for small \(N\), can not really compare inequality across arrays with different \(N\), can only compare arrays with the same \(N\).

Formula

Given monotonimcally increasing array \(X\), with \(x_1,...,x_N\):

There is a box, width = 1, height = 1
The width is discretized into \(N\) individuals, so each individual’s width is \(\frac{1}{N}\)
The height is normalized to 1, for the nth individual, total height is the sum of all \(x\), so need to rescale all bars by \(\sum_i^{N} x_i\)

The GINI formula used here is: \[ GINI = 1 - \frac{2}{N+1} \cdot \left(\sum_{i=1}^N \sum_{j=1}^{i} x_j\right) \cdot \left( \sum_{i=1}^N x_i \right)^{-1} \]

Derive the formula in the steps below.

Step 1 Area Formula

\[ \Gamma = \sum_{i=1}^N \frac{1}{N} \cdot \left( \sum_{j=1}^{i} \left( \frac{x_j}{\sum_{\widehat{j}=1}^N x_{\widehat{j}} } \right) \right) \]

Step 2 Total Area Given Perfect equality

With perfect equality \(x_i=a\) for all \(i\), so need to divide by that.

\[ \Gamma^{\text{equal}} = \sum_{i=1}^N \frac{1}{N} \cdot \left( \sum_{j=1}^{i} \left( \frac{a}{\sum_{\widehat{j}=1}^N a } \right) \right) = \frac{N+1}{N}\cdot\frac{1}{2} \]

As the number of elements of the vecotr increases: \[ \lim_{N \rightarrow \infty}\Gamma^{\text{equal}} = \lim_{N \rightarrow \infty} \frac{N+1}{N}\cdot\frac{1}{2} = \frac{1}{2} \]

Step 3 Arriving at Finite Vector Gini Formula

Given what we have from above, we obtain the gini formula, divide by total area below 45 degree line.

\[ GINI = 1 - \left(\sum_{i=1}^N \sum_{j=1}^{i} x_j\right) \cdot \left( N \cdot \sum_{i=1}^N x_i \right)^{-1} \cdot \left( \frac{N+1}{N}\cdot\frac{1}{2} \right)^{-1} = 1 - \frac{2}{N+1} \cdot \left(\sum_{i=1}^N \sum_{j=1}^{i} x_j\right) \cdot \left( \sum_{i=1}^N x_i \right)^{-1} \]

Step 4 Maximum Inequality given N

Suppose \(x_i=0\) for all \(i<N\), then:

\[ GINI^{x_i = 0 \text{ except } i=N} = 1 - \frac{2}{N+1} \cdot X_N \cdot \left( X_N \right)^{-1} = 1 - \frac{2}{N+1} \]

\[ \lim_{N \rightarrow \infty} GINI^{x_i = 0 \text{ except } i=N} = 1 - \lim_{N \rightarrow \infty} \frac{2}{N+1} = 1 \]

Note that for small N, for example if \(N=10\), even when one person holds all income, all others have 0 income, the formula will not produce gini is zero, but that gini is equal to \(\frac{2}{11}\approx 0.1818\). If \(N=2\), inequality is at most, \(\frac{2}{3}\approx 0.667\).

\[ MostUnequalGINI\left(N\right) = 1 - \frac{2}{N+1} = \frac{N-1}{N+1} \]

Implement GINI Formula in R

The GINI formula just derived is trivial to compute.

scalar: \(\frac{2}{N+1}\)
cumsum: \(\sum_{j=1}^{i} x_j\)
sum of cumsum: \(\left(\sum_{i=1}^N \sum_{j=1}^{i} x_j\right)\)
sum: \(\sum_{i=1}^N X_i\)

There are no package dependencies. This is the ff_dist_gini_vector_pos function. Define the formula here:

# Load Library
rm(list = ls())
# Formula, directly implement the GINI formula Following Step 4 above
fv_dist_gini_vector_pos_test <- function(ar_pos) {
  # Check length and given warning
  it_n <- length(ar_pos)
  if (it_n <= 100)  warning('Data vector has n=',it_n,', max-inequality/max-gini=',(it_n-1)/(it_n + 1))
  # Sort
  ar_pos <- sort(ar_pos)
  # formula implement
  fl_gini <- 1 - ((2/(it_n+1)) * sum(cumsum(ar_pos))*(sum(ar_pos))^(-1))
  return(fl_gini)
}

Testing

Generate a number of examples Arrays for testing

# Example Arrays of data
ar_equal_n1 = c(1)
ar_ineql_n1 = c(100)

ar_equal_n2 = c(1,1)
ar_ineql_alittle_n2 = c(1,2)
ar_ineql_somewht_n2 = c(1,2^3)
ar_ineql_alotine_n2 = c(1,2^5)
ar_ineql_veryvry_n2 = c(1,2^8)
ar_ineql_mostmst_n2 = c(1,2^13)

ar_equal_n10 = c(2,2,2,2,2,2, 2, 2, 2, 2)
ar_ineql_some_n10 = c(1,2,3,5,8,13,21,34,55,89)
ar_ineql_very_n10 = c(1,2^2,3^2,5^2,8^2,13^2,21^2,34^2,55^2,89^2)
ar_ineql_extr_n10 = c(1,2^2,3^3,5^4,8^5,13^6,21^7,34^8,55^9,89^10)

# Uniform draw testing
ar_unif_n1000 = runif(1000, min=0, max=1)

# Normal draw testing
ar_norm_lowsd_n1000 = rnorm(1000, mean=100, sd =1)
ar_norm_lowsd_n1000[ar_norm_lowsd_n1000<0] = 0
ar_norm_highsd_n1000 = rnorm(1000, mean=100, sd =20)
ar_norm_highsd_n1000[ar_norm_highsd_n1000<0] = 0

# Beta draw testing
ar_beta_mostrich_n1000 = rbeta(1000, 5, 1)
ar_beta_mostpoor_n1000 = rbeta(1000, 1, 5)
ar_beta_manyrichmanypoor_nomiddle_n1000 = rbeta(1000, 0.5, 0.5)

Now test the example arrays above using the function based no our formula:

#> 
#> Small N=1 Hard-Code
#> Warning in fv_dist_gini_vector_pos_test(ar_equal_n1): Data vector has n=1, max-
#> inequality/max-gini=0
#> ar_equal_n1: 0
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_n1): Data vector has n=1, max-
#> inequality/max-gini=0
#> ar_ineql_n1: 0
#> 
#> Small N=2 Hard-Code, converge to 1/3, see formula above
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_alittle_n2): Data vector has
#> n=2, max-inequality/max-gini=0.333333333333333
#> ar_ineql_alittle_n2: 0.1111111
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_somewht_n2): Data vector has
#> n=2, max-inequality/max-gini=0.333333333333333
#> ar_ineql_somewht_n2: 0.2592593
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_alotine_n2): Data vector has
#> n=2, max-inequality/max-gini=0.333333333333333
#> ar_ineql_alotine_n2: 0.3131313
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_veryvry_n2): Data vector has
#> n=2, max-inequality/max-gini=0.333333333333333
#> ar_ineql_veryvry_n2: 0.3307393
#> 
#> Small N=10 Hard-Code, convege to 9/11=0.8181, see formula above
#> Warning in fv_dist_gini_vector_pos_test(ar_equal_n10): Data vector has n=10,
#> max-inequality/max-gini=0.818181818181818
#> ar_equal_n10: 0
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_some_n10): Data vector has
#> n=10, max-inequality/max-gini=0.818181818181818
#> ar_ineql_some_n10: 0.5395514
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_very_n10): Data vector has
#> n=10, max-inequality/max-gini=0.818181818181818
#> ar_ineql_very_n10: 0.7059554
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_extr_n10): Data vector has
#> n=10, max-inequality/max-gini=0.818181818181818
#> ar_ineql_extr_n10: 0.8181549
#> 
#> UNIFORM Distribution
#> ar_unif_n1000: 0.3265402
#> 
#> NORMAL Distribution
#> ar_norm_lowsd_n1000: 0.005550226
#> ar_norm_highsd_n1000: 0.1153259
#> 
#> BETA Distribution
#> ar_beta_mostpoor_n1000: 0.454737
#> ar_beta_manyrichmanypoor_nomiddle_n1000: 0.4205884
#> ar_beta_mostrich_n1000: 0.09296366
#> 
#> 
#> SHOULD/DOES NOT WORK TEST
#>  ar_unif_n1000_NEGATIVE: -69.42136

Gini for Discrete Random Variable

ff_dist_gini_random_var provides the GINI implementation for a discrete random variable. The procedure is the same as prior, except now each element of the “x” array has element specific weights associated with it. The function can handle unsorted array with non-unique values.

Test and compare ff_dist_gini_random_var provides the GINI implementation for a discrete random variable and ff_dist_gini_vector_pos.

There is a vector of values from 1 to 100, in ascending order. What is the equal-weighted gini, the gini result when smaller numbers have higher weights, and when larger numbers have higher weights?

First, generate the relevant values.

# array
ar_x <- seq(1, 100, length.out = 30)
# prob array
ar_prob_x_unif <- rep.int(1, length(ar_x))/sum(rep.int(1, length(ar_x)))
# prob higher at lower values
ar_prob_x_lowval_highwgt <- rev(cumsum(ar_prob_x_unif))/sum(cumsum(ar_prob_x_unif))
# prob higher at lower values
ar_prob_x_highval_highwgt <- (cumsum(ar_prob_x_unif))/sum(cumsum(ar_prob_x_unif))
# show
print(cbind(ar_x, ar_prob_x_unif, ar_prob_x_lowval_highwgt, ar_prob_x_highval_highwgt))
#>             ar_x ar_prob_x_unif ar_prob_x_lowval_highwgt
#>  [1,]   1.000000     0.03333333              0.064516129
#>  [2,]   4.413793     0.03333333              0.062365591
#>  [3,]   7.827586     0.03333333              0.060215054
#>  [4,]  11.241379     0.03333333              0.058064516
#>  [5,]  14.655172     0.03333333              0.055913978
#>  [6,]  18.068966     0.03333333              0.053763441
#>  [7,]  21.482759     0.03333333              0.051612903
#>  [8,]  24.896552     0.03333333              0.049462366
#>  [9,]  28.310345     0.03333333              0.047311828
#> [10,]  31.724138     0.03333333              0.045161290
#> [11,]  35.137931     0.03333333              0.043010753
#> [12,]  38.551724     0.03333333              0.040860215
#> [13,]  41.965517     0.03333333              0.038709677
#> [14,]  45.379310     0.03333333              0.036559140
#> [15,]  48.793103     0.03333333              0.034408602
#> [16,]  52.206897     0.03333333              0.032258065
#> [17,]  55.620690     0.03333333              0.030107527
#> [18,]  59.034483     0.03333333              0.027956989
#> [19,]  62.448276     0.03333333              0.025806452
#> [20,]  65.862069     0.03333333              0.023655914
#> [21,]  69.275862     0.03333333              0.021505376
#> [22,]  72.689655     0.03333333              0.019354839
#> [23,]  76.103448     0.03333333              0.017204301
#> [24,]  79.517241     0.03333333              0.015053763
#> [25,]  82.931034     0.03333333              0.012903226
#> [26,]  86.344828     0.03333333              0.010752688
#> [27,]  89.758621     0.03333333              0.008602151
#> [28,]  93.172414     0.03333333              0.006451613
#> [29,]  96.586207     0.03333333              0.004301075
#> [30,] 100.000000     0.03333333              0.002150538
#>       ar_prob_x_highval_highwgt
#>  [1,]               0.002150538
#>  [2,]               0.004301075
#>  [3,]               0.006451613
#>  [4,]               0.008602151
#>  [5,]               0.010752688
#>  [6,]               0.012903226
#>  [7,]               0.015053763
#>  [8,]               0.017204301
#>  [9,]               0.019354839
#> [10,]               0.021505376
#> [11,]               0.023655914
#> [12,]               0.025806452
#> [13,]               0.027956989
#> [14,]               0.030107527
#> [15,]               0.032258065
#> [16,]               0.034408602
#> [17,]               0.036559140
#> [18,]               0.038709677
#> [19,]               0.040860215
#> [20,]               0.043010753
#> [21,]               0.045161290
#> [22,]               0.047311828
#> [23,]               0.049462366
#> [24,]               0.051612903
#> [25,]               0.053763441
#> [26,]               0.055913978
#> [27,]               0.058064516
#> [28,]               0.060215054
#> [29,]               0.062365591
#> [30,]               0.064516129

Second, generate GINI values. What should happen?

The ff_dist_gini_random_var and ff_dist_gini_vector_pos results should be the same when the uniform distribution is used.
GINI should be higher, more inequality, if there is higher weights on the lower values.
GINI should be lower, more equality, if there is higher weight on the higher values.

library(REconTools)
ff_dist_gini_vector_pos(ar_x)
#> Warning in ff_dist_gini_vector_pos(ar_x): Data vector has only n=30, max-
#> inequality/min-gini=0.935483870967742
#> [1] 0.3267327
ff_dist_gini_random_var(ar_x, ar_prob_x_unif)
#> [1] 0.3267327
ff_dist_gini_random_var(ar_x, ar_prob_x_lowval_highwgt)
#> [1] 0.4010343
ff_dist_gini_random_var(ar_x, ar_prob_x_highval_highwgt)
#> [1] 0.1926849