vignettes/fv_dist_gini_vector_pos.Rmd
fv_dist_gini_vector_pos.Rmd
View a vector positive vectors as population, calculate discrete GINI inequality measure. This file works out how the ff_dist_gini_vector_pos function works from Fan’s REconTools Package. See also the fs_gini_disc from R4Econ. See also the ff_dist_gini_random_var function for GINI for discrete random variables.
There is an vector values (all positive). This could be height information for N individuals. It could also be income information for N individuals. Calculate the GINI coefficient treating the given vector as population. This is not an estimation exercise where we want to estimate population gini based on a sample. The given array is the population. The population is discrete, and only has these N individuals in the length n vector.
See the formula below, note that when the sample size is small, there is a limit to inequality using the formula defined below given each \(N\). So for small \(N\), can not really compare inequality across arrays with different \(N\), can only compare arrays with the same \(N\).
Given monotonimcally increasing array \(X\), with \(x_1,...,x_N\):
The GINI formula used here is: \[ GINI = 1 - \frac{2}{N+1} \cdot \left(\sum_{i=1}^N \sum_{j=1}^{i} x_j\right) \cdot \left( \sum_{i=1}^N x_i \right)^{-1} \]
Derive the formula in the steps below.
\[ \Gamma = \sum_{i=1}^N \frac{1}{N} \cdot \left( \sum_{j=1}^{i} \left( \frac{x_j}{\sum_{\widehat{j}=1}^N x_{\widehat{j}} } \right) \right) \]
With perfect equality \(x_i=a\) for all \(i\), so need to divide by that.
\[ \Gamma^{\text{equal}} = \sum_{i=1}^N \frac{1}{N} \cdot \left( \sum_{j=1}^{i} \left( \frac{a}{\sum_{\widehat{j}=1}^N a } \right) \right) = \frac{N+1}{N}\cdot\frac{1}{2} \]
As the number of elements of the vecotr increases: \[ \lim_{N \rightarrow \infty}\Gamma^{\text{equal}} = \lim_{N \rightarrow \infty} \frac{N+1}{N}\cdot\frac{1}{2} = \frac{1}{2} \]
Given what we have from above, we obtain the gini formula, divide by total area below 45 degree line.
\[ GINI = 1 - \left(\sum_{i=1}^N \sum_{j=1}^{i} x_j\right) \cdot \left( N \cdot \sum_{i=1}^N x_i \right)^{-1} \cdot \left( \frac{N+1}{N}\cdot\frac{1}{2} \right)^{-1} = 1 - \frac{2}{N+1} \cdot \left(\sum_{i=1}^N \sum_{j=1}^{i} x_j\right) \cdot \left( \sum_{i=1}^N x_i \right)^{-1} \]
Suppose \(x_i=0\) for all \(i<N\), then:
\[ GINI^{x_i = 0 \text{ except } i=N} = 1 - \frac{2}{N+1} \cdot X_N \cdot \left( X_N \right)^{-1} = 1 - \frac{2}{N+1} \]
\[ \lim_{N \rightarrow \infty} GINI^{x_i = 0 \text{ except } i=N} = 1 - \lim_{N \rightarrow \infty} \frac{2}{N+1} = 1 \]
Note that for small N, for example if \(N=10\), even when one person holds all income, all others have 0 income, the formula will not produce gini is zero, but that gini is equal to \(\frac{2}{11}\approx 0.1818\). If \(N=2\), inequality is at most, \(\frac{2}{3}\approx 0.667\).
\[ MostUnequalGINI\left(N\right) = 1 - \frac{2}{N+1} = \frac{N-1}{N+1} \]
The GINI formula just derived is trivial to compute.
There are no package dependencies. This is the ff_dist_gini_vector_pos function. Define the formula here:
# Load Library
rm(list = ls())
# Formula, directly implement the GINI formula Following Step 4 above
fv_dist_gini_vector_pos_test <- function(ar_pos) {
# Check length and given warning
it_n <- length(ar_pos)
if (it_n <= 100) warning('Data vector has n=',it_n,', max-inequality/max-gini=',(it_n-1)/(it_n + 1))
# Sort
ar_pos <- sort(ar_pos)
# formula implement
fl_gini <- 1 - ((2/(it_n+1)) * sum(cumsum(ar_pos))*(sum(ar_pos))^(-1))
return(fl_gini)
}
Generate a number of examples Arrays for testing
# Example Arrays of data
ar_equal_n1 = c(1)
ar_ineql_n1 = c(100)
ar_equal_n2 = c(1,1)
ar_ineql_alittle_n2 = c(1,2)
ar_ineql_somewht_n2 = c(1,2^3)
ar_ineql_alotine_n2 = c(1,2^5)
ar_ineql_veryvry_n2 = c(1,2^8)
ar_ineql_mostmst_n2 = c(1,2^13)
ar_equal_n10 = c(2,2,2,2,2,2, 2, 2, 2, 2)
ar_ineql_some_n10 = c(1,2,3,5,8,13,21,34,55,89)
ar_ineql_very_n10 = c(1,2^2,3^2,5^2,8^2,13^2,21^2,34^2,55^2,89^2)
ar_ineql_extr_n10 = c(1,2^2,3^3,5^4,8^5,13^6,21^7,34^8,55^9,89^10)
# Uniform draw testing
ar_unif_n1000 = runif(1000, min=0, max=1)
# Normal draw testing
ar_norm_lowsd_n1000 = rnorm(1000, mean=100, sd =1)
ar_norm_lowsd_n1000[ar_norm_lowsd_n1000<0] = 0
ar_norm_highsd_n1000 = rnorm(1000, mean=100, sd =20)
ar_norm_highsd_n1000[ar_norm_highsd_n1000<0] = 0
# Beta draw testing
ar_beta_mostrich_n1000 = rbeta(1000, 5, 1)
ar_beta_mostpoor_n1000 = rbeta(1000, 1, 5)
ar_beta_manyrichmanypoor_nomiddle_n1000 = rbeta(1000, 0.5, 0.5)
Now test the example arrays above using the function based no our formula:
#>
#> Small N=1 Hard-Code
#> Warning in fv_dist_gini_vector_pos_test(ar_equal_n1): Data vector has n=1, max-
#> inequality/max-gini=0
#> ar_equal_n1: 0
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_n1): Data vector has n=1, max-
#> inequality/max-gini=0
#> ar_ineql_n1: 0
#>
#> Small N=2 Hard-Code, converge to 1/3, see formula above
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_alittle_n2): Data vector has
#> n=2, max-inequality/max-gini=0.333333333333333
#> ar_ineql_alittle_n2: 0.1111111
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_somewht_n2): Data vector has
#> n=2, max-inequality/max-gini=0.333333333333333
#> ar_ineql_somewht_n2: 0.2592593
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_alotine_n2): Data vector has
#> n=2, max-inequality/max-gini=0.333333333333333
#> ar_ineql_alotine_n2: 0.3131313
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_veryvry_n2): Data vector has
#> n=2, max-inequality/max-gini=0.333333333333333
#> ar_ineql_veryvry_n2: 0.3307393
#>
#> Small N=10 Hard-Code, convege to 9/11=0.8181, see formula above
#> Warning in fv_dist_gini_vector_pos_test(ar_equal_n10): Data vector has n=10,
#> max-inequality/max-gini=0.818181818181818
#> ar_equal_n10: 0
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_some_n10): Data vector has
#> n=10, max-inequality/max-gini=0.818181818181818
#> ar_ineql_some_n10: 0.5395514
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_very_n10): Data vector has
#> n=10, max-inequality/max-gini=0.818181818181818
#> ar_ineql_very_n10: 0.7059554
#> Warning in fv_dist_gini_vector_pos_test(ar_ineql_extr_n10): Data vector has
#> n=10, max-inequality/max-gini=0.818181818181818
#> ar_ineql_extr_n10: 0.8181549
#>
#> UNIFORM Distribution
#> ar_unif_n1000: 0.3265402
#>
#> NORMAL Distribution
#> ar_norm_lowsd_n1000: 0.005550226
#> ar_norm_highsd_n1000: 0.1153259
#>
#> BETA Distribution
#> ar_beta_mostpoor_n1000: 0.454737
#> ar_beta_manyrichmanypoor_nomiddle_n1000: 0.4205884
#> ar_beta_mostrich_n1000: 0.09296366
#>
#>
#> SHOULD/DOES NOT WORK TEST
#> ar_unif_n1000_NEGATIVE: -69.42136
ff_dist_gini_random_var provides the GINI implementation for a discrete random variable. The procedure is the same as prior, except now each element of the “x” array has element specific weights associated with it. The function can handle unsorted array with non-unique values.
Test and compare ff_dist_gini_random_var provides the GINI implementation for a discrete random variable and ff_dist_gini_vector_pos.
There is a vector of values from 1 to 100, in ascending order. What is the equal-weighted gini, the gini result when smaller numbers have higher weights, and when larger numbers have higher weights?
First, generate the relevant values.
# array
ar_x <- seq(1, 100, length.out = 30)
# prob array
ar_prob_x_unif <- rep.int(1, length(ar_x))/sum(rep.int(1, length(ar_x)))
# prob higher at lower values
ar_prob_x_lowval_highwgt <- rev(cumsum(ar_prob_x_unif))/sum(cumsum(ar_prob_x_unif))
# prob higher at lower values
ar_prob_x_highval_highwgt <- (cumsum(ar_prob_x_unif))/sum(cumsum(ar_prob_x_unif))
# show
print(cbind(ar_x, ar_prob_x_unif, ar_prob_x_lowval_highwgt, ar_prob_x_highval_highwgt))
#> ar_x ar_prob_x_unif ar_prob_x_lowval_highwgt
#> [1,] 1.000000 0.03333333 0.064516129
#> [2,] 4.413793 0.03333333 0.062365591
#> [3,] 7.827586 0.03333333 0.060215054
#> [4,] 11.241379 0.03333333 0.058064516
#> [5,] 14.655172 0.03333333 0.055913978
#> [6,] 18.068966 0.03333333 0.053763441
#> [7,] 21.482759 0.03333333 0.051612903
#> [8,] 24.896552 0.03333333 0.049462366
#> [9,] 28.310345 0.03333333 0.047311828
#> [10,] 31.724138 0.03333333 0.045161290
#> [11,] 35.137931 0.03333333 0.043010753
#> [12,] 38.551724 0.03333333 0.040860215
#> [13,] 41.965517 0.03333333 0.038709677
#> [14,] 45.379310 0.03333333 0.036559140
#> [15,] 48.793103 0.03333333 0.034408602
#> [16,] 52.206897 0.03333333 0.032258065
#> [17,] 55.620690 0.03333333 0.030107527
#> [18,] 59.034483 0.03333333 0.027956989
#> [19,] 62.448276 0.03333333 0.025806452
#> [20,] 65.862069 0.03333333 0.023655914
#> [21,] 69.275862 0.03333333 0.021505376
#> [22,] 72.689655 0.03333333 0.019354839
#> [23,] 76.103448 0.03333333 0.017204301
#> [24,] 79.517241 0.03333333 0.015053763
#> [25,] 82.931034 0.03333333 0.012903226
#> [26,] 86.344828 0.03333333 0.010752688
#> [27,] 89.758621 0.03333333 0.008602151
#> [28,] 93.172414 0.03333333 0.006451613
#> [29,] 96.586207 0.03333333 0.004301075
#> [30,] 100.000000 0.03333333 0.002150538
#> ar_prob_x_highval_highwgt
#> [1,] 0.002150538
#> [2,] 0.004301075
#> [3,] 0.006451613
#> [4,] 0.008602151
#> [5,] 0.010752688
#> [6,] 0.012903226
#> [7,] 0.015053763
#> [8,] 0.017204301
#> [9,] 0.019354839
#> [10,] 0.021505376
#> [11,] 0.023655914
#> [12,] 0.025806452
#> [13,] 0.027956989
#> [14,] 0.030107527
#> [15,] 0.032258065
#> [16,] 0.034408602
#> [17,] 0.036559140
#> [18,] 0.038709677
#> [19,] 0.040860215
#> [20,] 0.043010753
#> [21,] 0.045161290
#> [22,] 0.047311828
#> [23,] 0.049462366
#> [24,] 0.051612903
#> [25,] 0.053763441
#> [26,] 0.055913978
#> [27,] 0.058064516
#> [28,] 0.060215054
#> [29,] 0.062365591
#> [30,] 0.064516129
Second, generate GINI values. What should happen?
library(REconTools)
ff_dist_gini_vector_pos(ar_x)
#> Warning in ff_dist_gini_vector_pos(ar_x): Data vector has only n=30, max-
#> inequality/min-gini=0.935483870967742
#> [1] 0.3267327
ff_dist_gini_random_var(ar_x, ar_prob_x_unif)
#> [1] 0.3267327
ff_dist_gini_random_var(ar_x, ar_prob_x_lowval_highwgt)
#> [1] 0.4010343
ff_dist_gini_random_var(ar_x, ar_prob_x_highval_highwgt)
#> [1] 0.1926849