Chapter 5 Linear Regression

5.1 Linear and Polynomial Fitting

5.1.1 Fit Curves Through Points

Go back to fan’s REconTools research support package, R4Econ examples page, PkgTestR packaging guide, or Stat4Econ course page.

5.1.1.1 Polynomial Fit with N Sets of Points

There are three points defined by their x and y coordinates. We draw these points randomly, and we find a curve that fits through them. The fit is exact. If there are more than three sets of points, we will not be able to fit exactly. In the illustration before, first, we draw 3 sets of x and y points, then we draw 4 and 5 sets, and we compare the prediction results.

Note that we are not generating data from a particular set of quadratic (polynomial) parameters, we are just drawing random values. When we draw exactly three pairs of random x and y values, we can find some polynomial that provides an exact fit through the three points in 2d space.

We define the function here.

#' Test polynomial fit to random draw x and y points
#'
#' @description Draw random sets of x and y points, fit a polynomial curve through
#' and compare predictions of y and actual y values.
#'
#' @param it_xy_pairs The number of x and y pair points
#' @param it_seed The random seed for drawing values
#' @param it_poly_fit An integer value of the order of polynomial
#' @param fl_mean The mean of the the random normal draw
#' @param fl_sd The standard deviation of the random normal draw
#' @returns
#' \itemize{
#'   \item rs_lm_poly - polynomial fit estimation results
#'   \item df_data_predict - N by 3 where N = \code{it_xy_pairs} and
#' columns are x, y, y-predict, and residual
#'   \item td_table_return - Display version of df_data_predict with title
#' }
#' @import stats, tibble, dplyr
#' @author Fan Wang, \url{http://fanwangecon.github.io}
ffi_lm_quad_fit <- function(it_xy_pairs = 3, it_seed = 123,
                            it_poly_fit = 2, fl_mean = 1, fl_sd = 1,
                            verbose = FALSE) {

  # 1. Generate three pairs of random numbers
  set.seed(it_seed)
  mt_rnorm <- matrix(
    rnorm(it_xy_pairs * 2, mean = fl_mean, sd = fl_sd),
    nrow = it_xy_pairs, ncol = 2
  )
  colnames(mt_rnorm) <- c("x", "y")
  rownames(mt_rnorm) <- paste0("p", seq(1, it_xy_pairs))
  df_rnorm <- as_tibble(mt_rnorm)

  # 2. Quadratic fit using ORTHOGONAL POLYNOMIAL
  # For predictions, lm(y ~ x + I(x^2)) and lm(y ~ poly(x, 2)) are the same,
  # but they have different parameters because x is transformed by poly().
  rs_lm_quad <- stats::lm(y ~ poly(x, it_poly_fit), data = df_rnorm)
  if (verbose) print(stats::summary.lm(rs_lm_quad))

  # 3. Fit prediction
  ar_y_predict <- stats::predict(rs_lm_quad)
  df_data_predict <- cbind(df_rnorm, ar_y_predict) %>%
    mutate(res = ar_y_predict - y)
  if (verbose) print(df_data_predict)

  # 4. show values
  st_poly_order <- "Quadratic"
  if (it_poly_fit != 2) {
    st_poly_order <- paste0(it_poly_fit, "th order")
  }
  td_table_return <- kable(df_data_predict,
    caption = paste0(
      st_poly_order, " Fit of ", it_xy_pairs, " Sets of Random (X,Y) Points"
    )
  ) %>%
    kable_styling_fc()

  return(list(
    rs_lm_quad = rs_lm_quad,
    df_data_predict = df_data_predict,
    td_table_return = td_table_return
  ))
}

In the first example below, we simulate 3 set of points and estimate quadratic exact fit.

ls_ffi_lm_quad_fit <-
  ffi_lm_quad_fit(
    it_xy_pairs = 3, it_seed = 123,
    it_poly_fit = 2, fl_mean = 1, fl_sd = 1
  )
ls_ffi_lm_quad_fit$td_table_return
Table 5.1: Quadratic Fit of 3 Sets of Random (X,Y) Points
x y ar_y_predict res
0.4395244 1.070508 1.070508 0
0.7698225 1.129288 1.129288 0
2.5587083 2.715065 2.715065 0

In the second example below, we simulate 4 set of points and estimate a quadratic non-exact fit.

ls_ffi_lm_quad_fit <-
  ffi_lm_quad_fit(
    it_xy_pairs = 4, it_seed = 345,
    it_poly_fit = 2, fl_mean = 1, fl_sd = 1
  )
ls_ffi_lm_quad_fit$td_table_return
Table 5.2: Quadratic Fit of 4 Sets of Random (X,Y) Points
x y ar_y_predict res
0.2150918 0.9324684 0.9373687 0.0049003
0.7204856 0.3664796 1.5207575 1.1542779
0.8385421 0.0722760 -0.0080226 -0.0802987
0.7094034 2.7107710 1.6318915 -1.0788795

In the third example below, we simulate the same 4 sets of points as in the prior example, but now use a cubic polynomial to fit the data exactly.

ls_ffi_lm_cubic_fit <-
  ffi_lm_quad_fit(
    it_xy_pairs = 4, it_seed = 345,
    it_poly_fit = 3, fl_mean = 1, fl_sd = 1
  )
ls_ffi_lm_cubic_fit$td_table_return
Table 5.3: 3th order Fit of 4 Sets of Random (X,Y) Points
x y ar_y_predict res
0.2150918 0.9324684 0.9324684 0
0.7204856 0.3664796 0.3664796 0
0.8385421 0.0722760 0.0722760 0
0.7094034 2.7107710 2.7107710 0

5.1.2 Polynomial Time Series

Go back to fan’s REconTools research support package, R4Econ examples page, PkgTestR packaging guide, or Stat4Econ course page.

5.1.2.1 Analytical Solution for Time-Series Polynomial Coefficients

This file is developed in support of Bhalotra, Fernandez, and Wang (2021), where we use polynomials to model share parameter changes over time in the context of a nested-CES problem.

5.1.2.1.1 Mth Order Polynomial and its Derivative

We have a polynomial of Mth order:

\[ y\left(t\right) = a_0 + a_1 \cdot t + a_2 \cdot t^2 + ... + a_M \cdot t^M \]

Taking derivative of of \(y\) with respect to \(t\), we have:

\[ \begin{aligned} \frac{d^{M-0}y}{dt^{M-0}} &= a_{M-0} \cdot \frac{\left(M-0\right)!}{0!} \cdot t^{-0+0} \\ \frac{d^{M-1}y}{dt^{M-1}} &= a_{M-0} \cdot \frac{\left(M-0\right)!}{1!} \cdot t^{-0+1} + a_{M-1} \cdot \frac{\left(M-1\right)!}{0!} \cdot t^{-1+1} \\ \frac{d^{M-2}y}{dt^{M-2}} &= a_{M-0} \cdot \frac{\left(M-0\right)!}{2!} \cdot t^{-0+2} + a_{M-1} \cdot \frac{\left(M-1\right)!}{1!} \cdot t^{-1+2} + a_{M-2} \cdot \frac{\left(M-2\right)!}{0!} \cdot t^{-2+2} \\ \frac{d^{M-3}y}{dt^{M-3}} &= a_{M-0} \cdot \frac{\left(M-0\right)!}{3!} \cdot t^{-0+3} + a_{M-1} \cdot \frac{\left(M-1\right)!}{2!} \cdot t^{-1+3} + a_{M-2} \cdot \frac{\left(M-2\right)!}{1!} \cdot t^{-2+3} + a_{M-3} \cdot \frac{\left(M-3\right)!}{0!} \cdot t^{-3+3} \end{aligned} \]

Given the structure above, we have the following formulation for polynomial derivative, where we have a polynomial of \(M^{th}\) order and we are interested in the \(N^{th}\) derivative:

\[ \begin{aligned} \frac{d^{N}y(M)}{dt^{N}} &= \sum_{i=0}^{M-N} \left( a_{M-i} \cdot \frac{\left(M-i\right)!}{\left(M-N-i\right)!} \cdot t^{M-N-i} \right) \end{aligned} \]

Note that the \(M^{th}\) derivative of a \(M^{th}\) order polynomial is equal to \(\left(a_M \cdot M!\right)\)

5.1.2.1.2 Repeatedly Taking Differences of Differences

Suppose the data generating process is a \(M^{th}\) order polynomial, then differencing of time-series observables can be used to identify polynomial coefficients.

In the simplest case of a 1st order polynomial with \(Y_t = a_0 + a_1 \cdot t\), given 2 periods of data, \(t=0,t=1\), we have \(a_1 = Y_2 - Y_1\), and \(a_0 = Y_0\). The rate of change in \(Y\) over \(t\) captures coefficient \(a_1\).

For a second order polynomial, while the first derivative is time-varying, the second derivative, acceleration, is invariant over \(t\). Similarly, for any \(M^{th}\) order polynomial, the \(M^{th}\) derivative is time-invariant.

Because of this time-invariance of the \(M^{th}\) derivative, the differencing idea can be used to identify \(a_M\), the coefficient for the highest polynomial term for the \(M^{th}\) order polynomial.

Now we difference observable \(y_t\) overtime. We difference the differences, difference the differences of the differences, and difference the differences of the differences of the differences, etc. It turns out that the difference is a summation over all observations \(\left\{y_t\right\}_{t=1}^{T}\), with the number of times each \(y_t\) term appearing following Pascal’s Triangle.

Specifically:

  • 1st difference is \(y_1 - y_0\)
  • 2nd difference is \((y_2-y_1) - (y_1-y_0) = y_2 - 2y_1 + y_0\)
  • 3rd difference is \(((y_3-y_2) - (y_2-y_1)) - ((y_2-y_1) - (y_1-y_0)) = y_3 - 3y_2 + 3y_1 - y_0\)
  • 4th difference is \((((y_4 - y_3) - (y_3 - y_2)) - ((y_3-y_2) - (y_2-y_1))) - (((y_3-y_2) - (y_2-y_1)) - ((y_2-y_1) - (y_1-y_0))) = y_4 - 4y_3 + 6y_2 - 4y_1 + y_0\)
  • 5th difference is \(((((y_5-y_4) - (y_4-y_3)) - ((y_4-y_3) - (y_3-y_2))) - (((y_4-y_3) - (y_3-y_2)) - ((y_3-y_2) - (y_2-y_1)))) - ((((y_4-y_3) - (y_3-y_2)) - ((y_3-y_2) - (y_2-y_1))) - (((y_3-y_2) - (y_2-y_1)) - ((y_2-y_1) - (y_1-y_0)))) = y_5 - 5y_4 + 10y_3 - 10y_2 + 5y_1 - y_0\)

Note that the pattern has alternating signs, and the coefficients are binomial. We have the following formula:

\[ \begin{aligned} \Delta^M &= \sum_{i=0}^{M} \left(\left(-1\right)^i\cdot\frac{M!}{\left(M-i\right)!i!}\right) \cdot y_{\left(M-i\right)} \\ \end{aligned} \]

When there are \(T\) periods of data, and we are interested in the \(T-1\) difference, the differencing formula is unique. However, for less than \(T-1\) difference, we can use alternative consecutive data segments. Specifically, given \(T\) periods of data from \(t=1\) to \(t=T\), we have the notation \(\Delta^M_{\tau}\) where \(\tau\) is the starting time. We have, for \(M \le T-1\):

\[ \begin{aligned} \Delta^M_{\tau}\left(\left\{y_t\right\}_{t=1}^{T}\right) &= \sum_{i=0}^{M} \left(\left(-1\right)^i\cdot\frac{M!}{\left(M-i\right)!i!}\right) \cdot y_{\left(\tau + \left(M-i\right)\right)} \thinspace\thinspace\text{for}\thinspace\thinspace 1 \le \tau \le T-M \\ \end{aligned} \]

5.1.2.1.3 Solutions for Polynomial Coefficients with Differences of Differences

Intuitively, for a \(M^{th}\) order polynomial, the coefficient on the highest polynomial term is a function of the \((M-1)^{th}\) difference. Coefficients of lower polynomial terms, \(m<M\), are function of the \((m-1)^{th}\) difference along with higher order polynomial coefficients already computed.

Formally, we have, for a \(M^{th}\) order polynomial, a vector of \(\left\{a_m\right\}_{m=0}^{M}\) \(M+1\) polynomial coefficients. For formula for the coefficient for the largest polynomial is:

\[ \begin{aligned} a_M &= \sum_{i=0}^{M} \left( \left(-1\right)^i \left( \left(M-i\right)! i! \right)^{-1} \right) % \cdot y_{\left(M-i\right)} \end{aligned} \]

Given this, we have also, given \(T\) periods of data from \(t=1\) to \(t=T\): \[ \begin{aligned} a_{M-1} &= \sum_{i=0}^{M-1} \left( \left(-1\right)^i \left( \left(M-1-i\right)! i! \right)^{-1} \right) \cdot \left( y_{\left(\tau + \left(M-1-i\right)\right)} - a_M \cdot t^M \right) \thinspace\thinspace \text{for} \thinspace\thinspace 1 \le \tau \le T-M-1 \end{aligned} \]

Using one formula, given \(a_{m+1}\), we have:

\[ \begin{aligned} a_{m} &= \sum_{i=0}^{m} \left( \left(-1\right)^i \left( \left(m-i\right)! i! \right)^{-1} \right) \cdot \left( y_{\left(\tau + \left(m-i\right)\right)} - \sum_{j=0}^{M-m-1} a_{M-j} \cdot t^{M-j} \right) \thinspace\thinspace \text{for} \thinspace\thinspace 1 \le \tau \le T-m \end{aligned} \]

5.1.2.1.4 Identifying Polynomial Coefficients with Differences for Third Order Polynomial

To illustrate, we test the formulas with a 3rd order polynomial, and derive some 3rd-order specific formulas.

For data from a 3rd order polynomial data generating process, we can use the 3rd difference to identify the coefficient in front of \(x^3\). With this, we can iteratively to lower polynomials and identify all relevant coefficients.

Specifically, using equations from the two sections above, we have:

\[ \begin{aligned} \frac{d^{3}y(3)}{dt^{3}} &= y_3 - 3y_2 + 3y_1 - y_0 \\ \frac{d^{3}y(3)}{dt^{3}} &= 3! \cdot a_3 \end{aligned} \]

Combining the two equations we have, that \(a_3\) is the 3rd difference divided by \(6\):

\[ \begin{aligned} 3! \cdot a_3 &= y_3 - 3y_2 + 3y_1 - y_0\\ a_3 &= \frac{y_3 - 3y_2 + 3y_1 - y_0 }{3\cdot2} \end{aligned} \]

For the linear and cubic terms, we have:

\[ \begin{aligned} \frac{d^{2}y(M=3)}{dt^{2}} &= 3 \cdot a_2 + 6 \cdot a_3 \cdot t \\ \frac{d^{1}y(M=3)}{dt^{1}} &= a_1 + 2 \cdot a_2 + 3 \cdot a_3 \cdot t^2 \end{aligned} \]

Note that the 3rd derivative of a 3rd order polynomial is a constant, but the 2nd derivative of a 3rd order polynomial is not. This means that to use the second difference to identify \(a_2\) parameter, we first have to difference out from \(y_t\) the impact of the 3rd polynomial term, which we can because we know \(a_3\) now.

Differencing out the 3rd term, we have now the 2nd derivative of a 2nd order polynomial: \[ \begin{aligned} \frac{d^{2}\left(y(M=3) - a_3\cdot t^3\right)}{dt^{2}} &= \frac{d^{2}\left(\widehat{y}(M=2)\right)}{dt^{2}} \thinspace\thinspace, \\ \end{aligned} \] where \(\widehat{y}\left(M,2\right)=y\left(M\right)-a_3\cdot t^3\).

So this means we have:

\[ \begin{aligned} \frac{d^{2}\widehat{y}(3,2)}{dt^{2}} &= \left(1 \cdot y_2 - 2 \cdot y_1 + 1 \cdot y_0\right) - a_3\cdot\left(1\cdot 2^3 - 2 \cdot 1^3 + 1 \cdot 0^3 \right) \\ &= \left(1 \cdot y_2 - 2 \cdot y_1 + 1 \cdot y_0\right) - a_3\cdot \left(2^3 - 2\right) \\ \frac{d^{2}\widehat{y}(3,2)}{dt^{2}} &= 2! \cdot a_2 \end{aligned} \]

Given the value for \(a_3\), we have:

\[ \begin{aligned} 2! \cdot a_2 &= \left(1 \cdot y_2 - 2 \cdot y_1 + 1 \cdot y_0\right) - a_3\cdot 6 \\ 2! \cdot a_2 &= \left(1 \cdot y_2 - 2 \cdot y_1 + 1 \cdot y_0\right) - \frac{y_3 - 3y_2 + 3y_1 - y_0 }{3\cdot2} \cdot \left(2^3 - 2\right) \\ 2! \cdot a_2 &= \left(y_2 - 2y_1 + y_0\right) - \left(y_3 - 3y_2 + 3y_1 - y_0\right) \\ 2! \cdot a_2 & = - y_3 + 4y_2 - 5y_1 + 2y_0 \\ a_2 & = \frac{- y_3 + 4y_2 - 5y_1 + 2y_0}{2} \end{aligned} \]

Following the same strategy, we can also find \(a_1\). Let \(\widehat{y}\left(M,1\right)=y\left(M\right)-a_3\cdot t^3-a_2\cdot t^2\)

\[ \begin{aligned} \frac{d^{2}\widehat{y}(3,1)}{dt^{2}} &= \left(y_1 - y_0\right) - a_3\cdot\left(1^3 - 0^3 \right) - a_2\cdot\left(1^2 - 0^2 \right) \\ &= \left(y_1 - y_0\right) - a_3 - a_2 \\ \frac{d^{2}\widehat{y}(3,1)}{dt^{2}} &= 1! \cdot a_1 \end{aligned} \]

Hence:

\[ \begin{aligned} a_1 &= \left(y_1 - y_0\right) - a_3\cdot\left(1^3 - 0^3 \right) - a_2\cdot\left(1^2 - 0^2 \right) \\ &= \left(y_1 - y_0\right) - a_3 - a_2 \\ &= \left(y_1 - y_0\right) - \frac{y_3 - 3y_2 + 3y_1 - y_0 }{3\cdot2} - \frac{- y_3 + 4y_2 - 5y_1 + 2y_0}{2} \\ &= \frac{6y_1 - 6y_0}{6} - \frac{y_3 - 3y_2 + 3y_1 - y_0 }{6} - \frac{-3y_3 + 12y_2 - 15y_1 + 6y_0}{6} \\ &= \frac{6y_1 - 6y_0 -y_3 + 3y_2 - 3y_1 + y_0 +3y_3 - 12y_2 + 15y_1 - 6y_0}{6} \\ &= \frac{2y_3 - 9y_2 + 18y_1 - 11y_0}{6} \end{aligned} \]

Finally, we know that \(a_0 = y_0\). We have now analytical expressions for each of the 4 polynomial coefficients for a 3rd order polynomial. Given data from the data-generating process, these would back out the underlying parameters of the data generating process using data from four periods at \(t=0,1,2,3\).

5.1.2.1.5 Third Order Polynomial Simulation and Solving for Parameters

Now we generated a time-series of values and solve back for the underlying polynomial coefficients.

# polynomial coefficients
set.seed(123)
ar_coef_poly <- rnorm(4)
# time right hand side matrix
ar_t <- 0:3
ar_power <- 0:3
mt_t_data <- do.call(rbind, lapply(ar_power, function(power) {
  ar_t^power
}))
# Final matrix, each row is an observation, or time.
mt_t_data <- t(mt_t_data)
# General model prediction
ar_y <- mt_t_data %*% matrix(ar_coef_poly, ncol = 1, nrow = 4)
# Prediction and Input time matrix
mt_all_data <- cbind(ar_y, mt_t_data)
st_cap <- paste0(
  "C1=Y, each row is time, t=0, incremental by 1, ",
  "each column a polynomial term from 0th to higher."
)
kable(mt_all_data, caption = st_cap) %>% kable_styling_fc()
Table 5.4: C1=Y, each row is time, t=0, incremental by 1, each column a polynomial term from 0th to higher.
-0.5604756 1 0 0 0
0.8385636 1 1 1 1
5.7780698 1 2 4 8
14.6810933 1 3 9 27

Backing out coefficients using the formulas for 3rd order polynomial derived above, we have:

# The constant term
alpha_0 <- ar_y[1]
# The cubic term
alpha_3 <- as.numeric((t(ar_y) %*% c(-1, +3, -3, +1))/(3*2))
# The quadratic term, difference cubic out, alpha_2_1t3 = alpha_2_2t4
ar_y_hat <- ar_y - alpha_3*ar_t^3
alpha_2_1t3 <- as.numeric((t(ar_y_hat[1:3]) %*% c(1, -2, +1))/(2))
alpha_2_2t4 <- as.numeric((t(ar_y_hat[2:4]) %*% c(1, -2, +1))/(2))
alpha_2 <- alpha_2_1t3
# The linear term, difference cubic out and quadratic
ar_y_hat <- ar_y - alpha_3*ar_t^3 - alpha_2*ar_t^2
alpha_1_1t2 <- as.numeric((t(ar_y_hat[1:2]) %*% c(-1, +1))/(1))
alpha_1_2t3 <- as.numeric((t(ar_y_hat[2:3]) %*% c(-1, +1))/(1))
alpha_1_3t4 <- as.numeric((t(ar_y_hat[3:4]) %*% c(-1, +1))/(1))
alpha_1 <- alpha_1_1t2
# Collect results
ar_names <- c("Constant", "Linear", "Quadratic", "Cubic")
ar_alpha_solved <- c(alpha_0, alpha_1, alpha_2, alpha_3)
mt_alpha <- cbind(ar_names, ar_alpha_solved, ar_coef_poly)
# Display
ar_st_varnames <- c('Coefficient Counter', 'Polynomial Terms', 'Solved Coefficient Given Y', 'Actual DGP Coefficient')
tb_alpha <- as_tibble(mt_alpha) %>%
  rowid_to_column(var = "polynomial_term_coef") %>%
  rename_all(~c(ar_st_varnames))
# Display
st_cap = paste0('Solving for polynomial coefficients.')
kable(tb_alpha, caption = st_cap) %>% kable_styling_fc()
Table 5.5: Solving for polynomial coefficients.
Coefficient Counter Polynomial Terms Solved Coefficient Given Y Actual DGP Coefficient
1 Constant -0.560475646552213 -0.560475646552213
2 Linear -0.230177489483281 -0.23017748948328
3 Quadratic 1.55870831414913 1.55870831414912
4 Cubic 0.0705083914245757 0.070508391424576

Note that given that the data is exact output from DGP, and we have the same number of data and parameters, parameters are exactly identified. However, this is only really true for the \(a_3\) parameter, which requires all four periods of data. - For \(a_2\), it is over-identified, we can arrived at it, given \(a_3\), either with difference of differences using data from \(t=1,2,3\) or using data from \(t=2,3,4\). - For \(a_1\), it is also over-identified, given \(a_3\) and \(a_2\). The difference of \(t=1,2\), \(t=2,3\) or \(t=3,4\) can all identify \(a_1\).

Note also that the solution above can be found by running a linear regression as well. The point of doing the exercise here and showing analytically how layers of differences of differences identify each polynomial coefficient is to show what in the underlying variation of the data is identifying each of the polynomial term.

In effect, all identification is based on the fact that the \(M^{th}\) order polynomial’s \(M^{th}\) derivative is a constant, it is invariant over \(t\). This is the core assumption, or restriction of the otherwise highly flexible polynomial functional form. With this core invariance at the max degree derivative condition, all other parameters are obtained through the simple act of differencing.

5.2 OLS and IV

Back to Fan’s R4Econ Homepage Table of Content

5.2.1 OLS and IV Regression

Go back to fan’s REconTools research support package, R4Econ examples page, PkgTestR packaging guide, or Stat4Econ course page.

IV regression using AER package. Option to store all results in dataframe row for combining results from other estimations together. Produce Row Statistics.

5.2.1.1 Construct Program

# IV regression function
# The code below uses the AER library's regresison function
# All results are stored in a single row as data_frame
# This functoin could work with dplyr do
# var.y is single outcome, vars.x, vars.c and vars.z are vectors of endogenous variables, controls and instruments.
regf.iv <- function(var.y, vars.x, 
                    vars.c, vars.z, df, transpose=TRUE) {
  
  # A. Set-Up Equation
  str.vars.x <- paste(vars.x, collapse='+')
  str.vars.c <- paste(vars.c, collapse='+')
  
  df <- df %>% 
    select(one_of(var.y, vars.x, vars.c, vars.z)) %>% 
    drop_na() %>% filter_all(all_vars(!is.infinite(.)))
  
  if (length(vars.z) >= 1) {
    #     library(AER)
    str.vars.z <- paste(vars.z, collapse='+')
    equa.iv <- paste(var.y,
                     paste(paste(str.vars.x, str.vars.c, sep='+'),
                           paste(str.vars.z, str.vars.c, sep='+'),
                           sep='|'),
                     sep='~')
    #     print(equa.iv)
    
    # B. IV Regression
    ivreg.summ <- summary(ivreg(as.formula(equa.iv), data=df),
                          vcov = sandwich, df = Inf, diagnostics = TRUE)
    
    # C. Statistics from IV Regression
    #     ivreg.summ$coef
    #     ivreg.summ$diagnostics
    
    # D. Combine Regression Results into a Matrix
    df.results <- suppressWarnings(suppressMessages(
      as_tibble(ivreg.summ$coef, rownames='rownames') %>%
        full_join(as_tibble(ivreg.summ$diagnostics, rownames='rownames')) %>%
        full_join(tibble(rownames=c('vars'),
                         var.y=var.y,
                         vars.x=str.vars.x,
                         vars.z=str.vars.z,
                         vars.c=str.vars.c))))
  } else {
    
    # OLS regression
    equa.ols <- paste(var.y,
                      paste(paste(vars.x, collapse='+'),
                            paste(vars.c, collapse='+'), sep='+'),
                      sep='~')
    
    lmreg.summ <- summary(lm(as.formula(equa.ols), data=df))
    
    lm.diagnostics <- as_tibble(
      list(df1=lmreg.summ$df[[1]],
           df2=lmreg.summ$df[[2]],
           df3=lmreg.summ$df[[3]],
           sigma=lmreg.summ$sigma,
           r.squared=lmreg.summ$r.squared,
           adj.r.squared=lmreg.summ$adj.r.squared)) %>%
      gather(variable, value) %>%
      rename(rownames = variable) %>%
      rename(v = value)
    
    df.results <- suppressWarnings(suppressMessages(
      as_tibble(lmreg.summ$coef, rownames='rownames') %>%
        full_join(lm.diagnostics) %>%
        full_join(tibble(rownames=c('vars'),
                         var.y=var.y,
                         vars.x=str.vars.x,
                         vars.c=str.vars.c))))
  }
  
  # E. Flatten Matrix, All IV results as a single tibble 
  # row to be combined with other IV results
  df.row.results <- df.results %>%
    gather(variable, value, -rownames) %>%
    drop_na() %>%
    unite(esti.val, rownames, variable) %>%
    mutate(esti.val = gsub(' ', '', esti.val))
  
  if (transpose) {
    df.row.results <- df.row.results %>% spread(esti.val, value)
  }
  
  # F. Return
  return(data.frame(df.row.results))
}

5.2.1.2 Program Testing

Load Data

# Library
library(tidyverse)
library(AER)

# Load Sample Data
setwd('C:/Users/fan/R4Econ/_data/')
df <- read_csv('height_weight.csv')
5.2.1.2.1 Example No Instrument, OLS
# One Instrucments
var.y <- c('hgt')
vars.x <- c('prot')
vars.z <- NULL
vars.c <- c('sex', 'hgt0', 'wgt0')
# Regression
regf.iv(var.y, vars.x, vars.c, vars.z, df, transpose=FALSE) %>%
  kable() %>%
  kable_styling_fc()
esti.val value
(Intercept)_Estimate 52.1186286658655
prot_Estimate 0.37447238635789
sexMale_Estimate 0.611043720578337
hgt0_Estimate 0.148513781160842
wgt0_Estimate 0.00150560230505629
(Intercept)_Std.Error 1.57770483608693
prot_Std.Error 0.004181211911338
sexMale_Std.Error 0.118396259120663
hgt0_Std.Error 0.0393807494783184
wgt0_Std.Error 0.000187123663624396
(Intercept)_tvalue 33.0344608660336
prot_tvalue 89.5607288744324
sexMale_tvalue 5.16100529794268
hgt0_tvalue 3.77122790013451
wgt0_tvalue 8.04602836377986
(Intercept)_Pr(>|t|) 9.92126150965951e-233
prot_Pr(>|t|) 0
sexMale_Pr(>|t|) 2.48105505495376e-07
hgt0_Pr(>|t|) 0.000162939618371172
wgt0_Pr(>|t|) 9.05257561534482e-16
df1_v 5
df2_v 18958
df3_v 5
sigma_v 8.06197784622979
r.squared_v 0.319078711001326
adj.r.squared_v 0.318935041565942
vars_var.y hgt
vars_vars.x prot
vars_vars.c sex+hgt0+wgt0
5.2.1.2.2 Example 1 Insturment
# One Instrucments
var.y <- c('hgt')
vars.x <- c('prot')
vars.z <- c('momEdu')
vars.c <- c('sex', 'hgt0', 'wgt0')
# Regression
regf.iv(var.y, vars.x, vars.c, vars.z, df, transpose=FALSE) %>%
  kable() %>%
  kable_styling_fc()
esti.val value
(Intercept)_Estimate 43.4301969117894
prot_Estimate 0.130833343849431
sexMale_Estimate 0.868121847262594
hgt0_Estimate 0.412093881816612
wgt0_Estimate 0.000858630042618959
(Intercept)_Std.Error 1.82489550970971
prot_Std.Error 0.0192036220809207
sexMale_Std.Error 0.133730167005418
hgt0_Std.Error 0.0459431912925973
wgt0_Std.Error 0.000226910577025037
(Intercept)_zvalue 23.7987307660689
prot_zvalue 6.81295139521715
sexMale_zvalue 6.49159323361512
hgt0_zvalue 8.96963990141912
wgt0_zvalue 3.78400184723085
(Intercept)_Pr(>|z|) 3.44237661591474e-125
prot_Pr(>|z|) 9.56164541652958e-12
sexMale_Pr(>|z|) 8.49333228164569e-11
hgt0_Pr(>|z|) 2.97485394504032e-19
wgt0_Pr(>|z|) 0.000154326676599558
Weakinstruments_df1 1
Wu-Hausman_df1 1
Sargan_df1 0
Weakinstruments_df2 16394
Wu-Hausman_df2 16393
Weakinstruments_statistic 935.817456612075
Wu-Hausman_statistic 123.595856606734
Weakinstruments_p-value 6.3971492917806e-200
Wu-Hausman_p-value 1.30703637796418e-28
vars_var.y hgt
vars_vars.x prot
vars_vars.z momEdu
vars_vars.c sex+hgt0+wgt0
5.2.1.2.3 Example Multiple Instrucments
# Multiple Instrucments
var.y <- c('hgt')
vars.x <- c('prot')
vars.z <- c('momEdu', 'wealthIdx', 'p.A.prot', 'p.A.nProt')
vars.c <- c('sex', 'hgt0', 'wgt0')
# Regression
regf.iv(var.y, vars.x, vars.c, vars.z, df, transpose=FALSE) %>%
  kable() %>%
  kable_styling_fc()
esti.val value
(Intercept)_Estimate 42.243761355532
prot_Estimate 0.266999451947032
sexMale_Estimate 0.695548488813011
hgt0_Estimate 0.424954881262903
wgt0_Estimate 0.00048695142032974
(Intercept)_Std.Error 1.8535668678938
prot_Std.Error 0.0154939347964086
sexMale_Std.Error 0.133157977814372
hgt0_Std.Error 0.0463195803786077
wgt0_Std.Error 0.000224867994873511
(Intercept)_zvalue 22.7905246297013
prot_zvalue 17.2325142357589
sexMale_zvalue 5.22348341593647
hgt0_zvalue 9.17441129192881
wgt0_zvalue 2.16549901022443
(Intercept)_Pr(>|z|) 5.6929407426237e-115
prot_Pr(>|z|) 1.51424021933765e-66
sexMale_Pr(>|z|) 1.75588197501936e-07
hgt0_Pr(>|z|) 4.54048595586446e-20
wgt0_Pr(>|z|) 0.0303494911144483
Weakinstruments_df1 4
Wu-Hausman_df1 1
Sargan_df1 3
Weakinstruments_df2 14914
Wu-Hausman_df2 14916
Weakinstruments_statistic 274.147084958342
Wu-Hausman_statistic 17.7562545747099
Sargan_statistic 463.729664547247
Weakinstruments_p-value 8.6173195623464e-228
Wu-Hausman_p-value 2.52567249124201e-05
Sargan_p-value 3.45452874915773e-100
vars_var.y hgt
vars_vars.x prot
vars_vars.z momEdu+wealthIdx+p.A.prot+p.A.nProt
vars_vars.c sex+hgt0+wgt0
5.2.1.2.4 Example Multiple Endogenous Variables
# Multiple Instrucments
var.y <- c('hgt')
vars.x <- c('prot', 'cal')
vars.z <- c('momEdu', 'wealthIdx', 'p.A.prot', 'p.A.nProt')
vars.c <- c('sex', 'hgt0', 'wgt0')
# Regression
regf.iv(var.y, vars.x, vars.c, vars.z, df, transpose=FALSE) %>%
  kable() %>%
  kable_styling_fc()
esti.val value
(Intercept)_Estimate 44.0243196254372
prot_Estimate -1.40256232471058
cal_Estimate 0.0651048957501503
sexMale_Estimate 0.120832787571903
hgt0_Estimate 0.286525437984394
wgt0_Estimate 0.000850481389651284
(Intercept)_Std.Error 2.75354847245259
prot_Std.Error 0.198640060272442
cal_Std.Error 0.00758881298876464
sexMale_Std.Error 0.209984580636194
hgt0_Std.Error 0.0707828182891216
wgt0_Std.Error 0.000337112104445047
(Intercept)_zvalue 15.9882130515846
prot_zvalue -7.06082309271814
cal_zvalue 8.57906181724851
sexMale_zvalue 0.575436478268133
hgt0_zvalue 4.04795181810993
wgt0_zvalue 2.52284441417891
(Intercept)_Pr(>|z|) 1.54396598289425e-57
prot_Pr(>|z|) 1.65519210798224e-12
cal_Pr(>|z|) 9.56500647777971e-18
sexMale_Pr(>|z|) 0.564996139463126
hgt0_Pr(>|z|) 5.16677787150118e-05
wgt0_Pr(>|z|) 0.011640989283946
Weakinstruments(prot)_df1 4
Weakinstruments(cal)_df1 4
Wu-Hausman_df1 2
Sargan_df1 2
Weakinstruments(prot)_df2 14914
Weakinstruments(cal)_df2 14914
Wu-Hausman_df2 14914
Weakinstruments(prot)_statistic 274.147084958342
Weakinstruments(cal)_statistic 315.036848606229
Wu-Hausman_statistic 94.7020085425631
Sargan_statistic 122.0819796289
Weakinstruments(prot)_p-value 8.6173195623464e-228
Weakinstruments(cal)_p-value 1.18918641221312e-260
Wu-Hausman_p-value 1.35024050402095e-41
Sargan_p-value 3.09196773720141e-27
vars_var.y hgt
vars_vars.x prot+cal
vars_vars.z momEdu+wealthIdx+p.A.prot+p.A.nProt
vars_vars.c sex+hgt0+wgt0
5.2.1.2.5 Examples Line by Line

The examples are just to test the code with different types of variables.

# Selecting Variables
var.y <- c('hgt')
vars.x <- c('prot', 'cal')
vars.z <- c('momEdu', 'wealthIdx', 'p.A.prot', 'p.A.nProt')
vars.c <- c('sex', 'hgt0', 'wgt0')
# A. create Equation
str.vars.x <- paste(vars.x, collapse='+')
str.vars.c <- paste(vars.c, collapse='+')
str.vars.z <- paste(vars.z, collapse='+')
print(str.vars.x)
## [1] "prot+cal"
print(str.vars.c)
## [1] "sex+hgt0+wgt0"
print(str.vars.z)
## [1] "momEdu+wealthIdx+p.A.prot+p.A.nProt"
equa.iv <- paste(var.y,
                 paste(paste(str.vars.x, str.vars.c, sep='+'),
                       paste(str.vars.z, str.vars.c, sep='+'),
                       sep='|'),
                 sep='~')
print(equa.iv)
## [1] "hgt~prot+cal+sex+hgt0+wgt0|momEdu+wealthIdx+p.A.prot+p.A.nProt+sex+hgt0+wgt0"
# B. regression
res.ivreg <- ivreg(as.formula(equa.iv), data=df)
coef(res.ivreg)
##   (Intercept)          prot           cal       sexMale          hgt0          wgt0 
## 44.0243196254 -1.4025623247  0.0651048958  0.1208327876  0.2865254380  0.0008504814
# C. Regression Summary
ivreg.summ <- summary(res.ivreg, vcov = sandwich, df = Inf, diagnostics = TRUE)

ivreg.summ$coef
##                  Estimate   Std. Error    z value     Pr(>|z|)
## (Intercept) 44.0243196254 2.7535484725 15.9882131 1.543966e-57
## prot        -1.4025623247 0.1986400603 -7.0608231 1.655192e-12
## cal          0.0651048958 0.0075888130  8.5790618 9.565006e-18
## sexMale      0.1208327876 0.2099845806  0.5754365 5.649961e-01
## hgt0         0.2865254380 0.0707828183  4.0479518 5.166778e-05
## wgt0         0.0008504814 0.0003371121  2.5228444 1.164099e-02
## attr(,"df")
## [1] 0
## attr(,"nobs")
## [1] 14922
ivreg.summ$diagnostics
##                         df1   df2 statistic       p-value
## Weak instruments (prot)   4 14914 274.14708 8.617320e-228
## Weak instruments (cal)    4 14914 315.03685 1.189186e-260
## Wu-Hausman                2 14914  94.70201  1.350241e-41
## Sargan                    2    NA 122.08198  3.091968e-27
# D. Combine Regression Results into a Matrix
df.results <- suppressMessages(as_tibble(ivreg.summ$coef, rownames='rownames') %>%
    full_join(as_tibble(ivreg.summ$diagnostics, rownames='rownames')) %>%
    full_join(tibble(rownames=c('vars'),
                     var.y=var.y,
                     vars.x=str.vars.x,
                     vars.z=str.vars.z,
                     vars.c=str.vars.c)))
# E. Flatten Matrix, All IV results as a single tibble row to be combined with other IV results
df.row.results <- df.results %>%
    gather(variable, value, -rownames) %>%
    drop_na() %>%
    unite(esti.val, rownames, variable) %>%
    mutate(esti.val = gsub(' ', '', esti.val))
# F. Results as Single Colum
# df.row.results
# G. Results as Single Row
# df.row.results
# t(df.row.results %>% spread(esti.val, value)) %>%
#   kable() %>%
#   kable_styling_fc_wide()

5.2.2 IV Loop over RHS

Go back to fan’s REconTools research support package, R4Econ examples page, PkgTestR packaging guide, or Stat4Econ course page.

Regression with a Variety of Outcome Variables and Right Hand Side Variables. There are M outcome variables, and there are N alternative right hand side variables. Regress each M outcome variable and each N alternative right hand side variable, with some common sets of controls and perhaps shared instruments. The output file is a M by N matrix of coefficients, with proper variable names and row names. The matrix stores coefficients for this key endogenous variable.

  • Dependency: R4Econ/linreg/ivreg/ivregdfrow.R

5.2.2.1 Construct Program

The program relies on double lapply. lapply is used for convenience, not speed.

ff_reg_mbyn <- function(list.vars.y, list.vars.x,
                        vars.c, vars.z, df,
                        return_all = FALSE,
                        stats_ends = 'value', time = FALSE) {

  # regf.iv() function is from C:\Users\fan\R4Econ\linreg\ivreg\ivregdfrow.R
  if (time) {
    start_time <- Sys.time()
  }

  if (return_all) {
    df.reg.out.all <-
      bind_rows(lapply(list.vars.x,
                       function(x) (
                         bind_rows(
                           lapply(list.vars.y, regf.iv,
                                  vars.x=x, vars.c=vars.c, vars.z=vars.z, df=df))
                       )))

  } else {
    df.reg.out.all <-
      (lapply(list.vars.x,
              function(x) (
                bind_rows(
                  lapply(list.vars.y, regf.iv,
                         vars.x=x, vars.c=vars.c, vars.z=vars.z, df=df)) %>%
                  select(vars_var.y, starts_with(x)) %>%
                  select(vars_var.y, ends_with(stats_ends))
              ))) %>% reduce(full_join)
  }

  if (time) {
    end_time <- Sys.time()
    print(paste0('Estimation for all ys and xs took (seconds):',
                 end_time - start_time))
  }

  return(df.reg.out.all)
}

5.2.2.2 Prepare Data

# Library
library(tidyverse)
library(AER)

# Load Sample Data
setwd('C:/Users/fan/R4Econ/_data/')
df <- read_csv('height_weight.csv')

# Source Dependency
source('C:/Users/fan/R4Econ/linreg/ivreg/ivregdfrow.R')

# Setting
options(repr.matrix.max.rows=50, repr.matrix.max.cols=50)

Parameters.

var.y1 <- c('hgt')
var.y2 <- c('wgt')
var.y3 <- c('vil.id')
list.vars.y <- c(var.y1, var.y2, var.y3)

var.x1 <- c('prot')
var.x2 <- c('cal')
var.x3 <- c('wealthIdx')
var.x4 <- c('p.A.prot')
var.x5 <- c('p.A.nProt')
list.vars.x <- c(var.x1, var.x2, var.x3, var.x4, var.x5)

vars.z <- c('indi.id')
vars.c <- c('sex', 'wgt0', 'hgt0', 'svymthRound')

5.2.2.3 Program Testing

5.2.2.3.1 Test Program OLS Z-Stat
vars.z <- NULL
suppressWarnings(suppressMessages(
  ff_reg_mbyn(list.vars.y, list.vars.x,
              vars.c, vars.z, df,
              return_all = FALSE,
              stats_ends = 'value'))) %>%
  kable() %>%
  kable_styling_fc_wide()
vars_var.y prot_tvalue cal_tvalue wealthIdx_tvalue p.A.prot_tvalue p.A.nProt_tvalue
hgt 18.8756010031766 23.4421863484656 13.5088996182178 3.83682180045522 32.5448257554849
wgt 16.359112505607 17.368603130933 14.1390521528125 1.36958319982295 12.0961557911473
vil.id -14.9385580468905 -19.6150110809449 34.0972558327354 8.45943342783161 17.7801422421398
5.2.2.3.2 Test Program IV T-stat
vars.z <- c('indi.id')
suppressWarnings(suppressMessages(
  ff_reg_mbyn(list.vars.y, list.vars.x,
              vars.c, vars.z, df,
              return_all = FALSE,
              stats_ends = 'value'))) %>%
  kable() %>%
  kable_styling_fc_wide()
vars_var.y prot_zvalue cal_zvalue wealthIdx_zvalue p.A.prot_zvalue p.A.nProt_zvalue
hgt 8.87674929306359 12.0739764946734 4.62589553677888 26.6373587567245 32.1162192385715
wgt 5.60385871757577 6.12251870088371 5.17869536991513 11.9295584469951 12.3509307017258
vil.id -9.22106223346427 -13.0586007975956 -51.5866689219473 -29.9627476577358 -38.3528894620791
5.2.2.3.3 Test Program OLS Coefficient
vars.z <- NULL
suppressWarnings(suppressMessages(
  ff_reg_mbyn(list.vars.y, list.vars.x,
              vars.c, vars.z, df,
              return_all = FALSE,
              stats_ends = 'Estimate'))) %>%
  kable() %>%
  kable_styling_fc_wide()
vars_var.y prot_Estimate cal_Estimate wealthIdx_Estimate p.A.prot_Estimate p.A.nProt_Estimate
hgt 0.0494310938067476 0.00243408846205617 0.210456554881893 3.86952250259533e-05 0.00542428867316432
wgt 16.5557424523601 0.69907250036464 106.678721085982 0.00521731297924599 0.77951423205071
vil.id -0.0758835879205561 -0.00395676177098467 0.451733304543376 0.000149388430455129 0.00526237555580908
5.2.2.3.4 Test Program IV coefficient
vars.z <- c('indi.id')
suppressWarnings(suppressMessages(
  ff_reg_mbyn(list.vars.y, list.vars.x,
              vars.c, vars.z, df,
              return_all = FALSE,
              stats_ends = 'Estimate'))) %>%
  kable() %>%
  kable_styling_fc_wide()
vars_var.y prot_Estimate cal_Estimate wealthIdx_Estimate p.A.prot_Estimate p.A.nProt_Estimate
hgt 0.859205733631884 0.0238724384575233 0.144503490136916 0.00148073028434634 0.0141317656200728
wgt 98.9428234201429 2.71948246216963 69.1816142882735 0.221916473012471 2.1185694049434
vil.id -6.0245137913613 -0.168054407187466 -1.91414470908346 -0.00520794333267228 -0.0494468877742103
5.2.2.3.5 Test Program OLS Return All
vars.z <- NULL
t(suppressWarnings(suppressMessages(
  ff_reg_mbyn(list.vars.y, list.vars.x,
            vars.c, vars.z, df,
            return_all = TRUE,
            stats_ends = 'Estimate')))) %>%
  kable() %>%
  kable_styling_fc_wide()
X.Intercept._Estimate 27.3528514188611 99.8738847289486 31.4646660224049 27.9038445914726 219.626705179383 30.510398789855 35.7840188807902 -2662.74787734015 29.2381039651118 23.9948407749741 -547.959546430028 22.3367814226239 24.4904444950826 -476.703973630535 22.7781908464511
X.Intercept._Pr…t.. 5.68247182207379e-231 0.755297055538094 6.78164655341251e-84 8.24252674004467e-242 0.493216914827213 1.62608789535448e-79 2.26726906490865e-145 7.13318862989451e-05 1.53578035270423e-124 2.11912344055752e-165 0.0941551350855863 3.04337266226398e-49 2.34941965808017e-181 0.143844033032196 9.58029450711073e-52
X.Intercept._Std.Error 0.831272666092281 320.450650378664 1.61328519718754 0.828072565159453 320.522532223671 1.60831193651105 1.38461348429899 670.301542938553 1.22602177264147 0.866581042166723 327.343126852911 1.5098937308759 0.843371070670839 326.132837036935 1.50045265589571
X.Intercept._tvalue 32.904788686778 0.311666974652514 19.5034740771549 33.6973421962113 0.685214557790027 18.9704485163755 25.8440491058103 -3.9724627003943 23.8479483950095 27.6890903532572 -1.6739607509042 14.7936116071336 29.0387533397396 -1.46168652614562 15.1808794212527
adj.r.squared_v 0.814249026159781 0.60716936506893 0.037324751268097 0.81608922805658 0.607863678511207 0.045349871107604 0.935014931990565 0.92193683733695 0.0595431228127756 0.814690803458616 0.617300597776144 0.026113107419984 0.824542352656376 0.620250730454724 0.0385437355117912
df1_v 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
df2_v 18957 18962 18999 18957 18962 18999 25092 25102 30013 18587 18591 18845 18587 18591 18845
df3_v 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
hgt0_Estimate 0.603918173406165 56.3852027199177 -0.296844389234445 0.589847843438402 52.9707041800702 -0.273219210757899 0.439374451256045 47.1769696647486 -0.359081639820446 0.687269209411872 72.1055606233589 -0.108789161111504 0.622395388389211 62.7336220289252 -0.157811627494689
hgt0_Pr…t.. 1.14533314567228e-183 1.52417506966924e-12 1.40290395213742e-13 7.79174951113908e-177 3.05720143843438e-11 8.49149153665201e-12 2.71000479248473e-36 0.00520266507060087 2.41020063625353e-31 1.31914432912238e-220 4.78613024243947e-19 0.0034801146146179 1.11511327164595e-190 8.38546282719406e-15 2.13723119924767e-05
hgt0_Std.Error 0.0206657538633712 7.96735224000553 0.0401060913799595 0.0205836398278422 7.96822145797113 0.0399777363511634 0.0348701896610764 16.8823489375742 0.0307984635553862 0.0213841849324283 8.0774490640068 0.0372288594891345 0.0208846437570216 8.07589192978209 0.0371223237183417
hgt0_tvalue 29.2231378249682 7.07703149319762 -7.40147890309685 28.6561486875879 6.64774497790597 -6.83428417151857 12.6002885423504 2.79445531182863 -11.6590764073245 32.1391351404586 8.92677379355594 -2.92217281443326 29.8015803204667 7.76801157994421 -4.25112470577148
prot_Estimate 0.0494310938067476 16.5557424523601 -0.0758835879205561 NA NA NA NA NA NA NA NA NA NA NA NA
prot_Pr…t.. 9.54769322340913e-79 9.61203373208916e-60 3.56396093563316e-50 NA NA NA NA NA NA NA NA NA NA NA NA
prot_Std.Error 0.00261878251179546 1.01201959743756 0.00507971302734612 NA NA NA NA NA NA NA NA NA NA NA NA
prot_tvalue 18.8756010031766 16.359112505607 -14.9385580468905 NA NA NA NA NA NA NA NA NA NA NA NA
r.squared_v 0.814298005954592 0.607272921412825 0.0375780335372857 0.816137722617266 0.60796705182314 0.0456010419476621 0.93502787877066 0.921952383432195 0.0596997716363459 0.814740639193486 0.617403496088207 0.0263714328556816 0.824589538985803 0.620352835549783 0.0387987636986582
sexMale_Estimate 0.935177182449465 415.163616765342 -0.254089999175327 0.893484662055601 405.534891838043 -0.18138948961095 1.80682463132053 999.926876716704 -0.334367777515325 0.932686930233099 397.14194867534 -0.445232370682025 0.964669805007069 401.590563681033 -0.423829627017541
sexMale_Pr…t.. 2.36432111723075e-51 2.48252880292341e-67 0.0343768259467531 2.0876593533594e-47 2.51355675685191e-64 0.129768754080749 1.26527362034417e-66 2.64630894140163e-86 0.000311174554786564 7.9048902058919e-47 6.1944974268004e-59 7.93666802281656e-05 1.24556615237397e-52 1.18469030740813e-60 0.000156446936361652
sexMale_Std.Error 0.0618482294097284 23.8518341439671 0.120093045309629 0.0616078355613521 23.856750758352 0.11972270545355 0.104475287357897 50.5879876531385 0.0927193334338761 0.0647209948973254 24.4473730956476 0.112797805327956 0.0629827627260292 24.3549086073391 0.112083516545939
sexMale_tvalue 15.1205166481672 17.4059409544549 -2.11577613441495 14.5027763743757 16.9987478993161 -1.51508010885476 17.2942776901007 19.7660931597596 -3.60623577771709 14.4108867873977 16.2447698213451 -3.94717228218692 15.3164098120516 16.4891016491032 -3.78137339083065
sigma_v 4.21029844914315 1623.77111076428 8.18491760066961 4.18939119979502 1622.33549880859 8.15073036560541 8.18607049768594 3964.45339913597 7.93450742809862 4.35662621773428 1645.77655955938 7.6435668370875 4.23923961592693 1639.42085007515 7.59462918474114
svymthRound_Estimate 0.871665891005652 189.042906883819 -0.015475958799392 0.851989049736821 185.318286001897 0.0201471237605437 0.432815253441721 189.87799479506 0.00215144302579708 0.919614676961391 205.597385664746 -0.0509574460702801 0.921894094780682 205.945143306004 -0.0557204455206458
svymthRound_Pr…t.. 0 0 0.0397984032097092 0 0 0.0117151185126457 0 0 0.000447277200167213 0 0 1.3713938980249e-18 0 0 7.79141497752014e-23
svymthRound_Std.Error 0.00387681209575622 1.4955473831309 0.00752730297891319 0.00411253488213796 1.59266949679222 0.00799217807522279 0.000728323735328996 0.352701518968251 0.000612792699568233 0.00331108017589108 1.25083486490652 0.00578476859618168 0.00317113547025635 1.22639878616071 0.00565696328562864
svymthRound_tvalue 224.840892330022 126.403823119306 -2.05597660181156 207.168832400007 116.357025971267 2.52085521254881 594.262183761196 538.353209678557 3.51088227277016 277.738571133786 164.368128386085 -8.8088996513906 290.714194782148 167.926734460268 -9.84988636256525
vars_var.y hgt wgt vil.id hgt wgt vil.id hgt wgt vil.id hgt wgt vil.id hgt wgt vil.id
vars_vars.c sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound
vars_vars.x prot prot prot cal cal cal wealthIdx wealthIdx wealthIdx p.A.prot p.A.prot p.A.prot p.A.nProt p.A.nProt p.A.nProt
wgt0_Estimate -0.000146104685986974 0.637023553461056 -0.00090339059153386 -0.000116898230009977 0.64939400361476 -0.000941137072743922 0.00122231975126217 1.32870822160237 -0.000845938526704835 -0.000489534836079631 0.580023505722661 -0.0015619691115606 3.23596154259002e-05 0.655512063046742 -0.00115432723977404
wgt0_Pr…t.. 0.13601158349758 2.96480083692537e-63 2.05763549729284e-06 0.230228828648909 7.43034302413299e-66 6.66901196231758e-07 1.22269348058886e-13 6.75367630220942e-62 4.32675510884396e-09 7.77000489086255e-07 7.42419220782874e-54 1.40362012201848e-19 0.740027016459631 4.09082062948253e-67 2.7547278172825e-11
wgt0_Std.Error 9.79994437486569e-05 0.0378027371614794 0.00019022150316743 9.74307633896925e-05 0.037739875283113 0.000189270503626622 0.000164767846917987 0.0798131859486413 0.000144040382619523 9.90410500454322e-05 0.0374185042114355 0.000172365145002826 9.75208524392676e-05 0.0377202854835201 0.000173241059789274
wgt0_tvalue -1.49087260496799 16.851254731633 -4.7491507347553 -1.19980821193426 17.2071051836606 -4.97244448929307 7.41843614592217 16.6477281392748 -5.87292612891309 -4.94274682926999 15.5009805428139 -9.06197776548728 0.33182252427554 17.3782370584956 -6.66312732777168
cal_Estimate NA NA NA 0.00243408846205617 0.69907250036464 -0.00395676177098467 NA NA NA NA NA NA NA NA NA
cal_Pr…t.. NA NA NA 8.01672708887592e-120 4.71331900886782e-67 7.94646124034726e-85 NA NA NA NA NA NA NA NA NA
cal_Std.Error NA NA NA 0.000103833679413418 0.0402492068645181 0.000201721108117471 NA NA NA NA NA NA NA NA NA
cal_tvalue NA NA NA 23.4421863484656 17.368603130933 -19.6150110809449 NA NA NA NA NA NA NA NA NA
wealthIdx_Estimate NA NA NA NA NA NA 0.210456554881893 106.678721085982 0.451733304543376 NA NA NA NA NA NA
wealthIdx_Pr…t.. NA NA NA NA NA NA 1.93494257269637e-41 3.25483455344948e-45 4.82890644809744e-250 NA NA NA NA NA NA
wealthIdx_Std.Error NA NA NA NA NA NA 0.0155791042075756 7.54496977117115 0.0132483771350797 NA NA NA NA NA NA
wealthIdx_tvalue NA NA NA NA NA NA 13.5088996182178 14.1390521528125 34.0972558327354 NA NA NA NA NA NA
p.A.prot_Estimate NA NA NA NA NA NA NA NA NA 3.86952250259533e-05 0.00521731297924599 0.000149388430455129 NA NA NA
p.A.prot_Pr…t.. NA NA NA NA NA NA NA NA NA 0.000125048896903774 0.170833589209346 2.88060045452304e-17 NA NA NA
p.A.prot_Std.Error NA NA NA NA NA NA NA NA NA 1.00852286184785e-05 0.00380941660201473 1.76593895713677e-05 NA NA NA
p.A.prot_tvalue NA NA NA NA NA NA NA NA NA 3.83682180045522 1.36958319982295 8.45943342783161 NA NA NA
p.A.nProt_Estimate NA NA NA NA NA NA NA NA NA NA NA NA 0.00542428867316432 0.77951423205071 0.00526237555580908
p.A.nProt_Pr…t.. NA NA NA NA NA NA NA NA NA NA NA NA 5.25341325087684e-226 1.4795093994274e-33 3.76857802825063e-70
p.A.nProt_Std.Error NA NA NA NA NA NA NA NA NA NA NA NA 0.000166671307872962 0.0644431375975831 0.000295969260770984
p.A.nProt_tvalue NA NA NA NA NA NA NA NA NA NA NA NA 32.5448257554849 12.0961557911473 17.7801422421398
5.2.2.3.6 Test Program IV Return All
vars.z <- c('indi.id')
t(suppressWarnings(suppressMessages(
  ff_reg_mbyn(list.vars.y, list.vars.x,
            vars.c, vars.z, df,
            return_all = TRUE,
            stats_ends = 'Estimate')))) %>%
  kable() %>%
  kable_styling_fc_wide()
X.Intercept._Estimate 40.2173991882485 1408.16266370338 -64.4906360678745 39.6732302989797 1325.54736576349 -59.8304089440739 35.5561817357076 -2791.22153490887 21.8005242861703 24.300926170769 -499.067024090765 21.4632286881799 25.2992097396208 -352.278518335053 17.9359211845123
X.Intercept._Pr…z.. 3.69748205151063e-59 0.00217397545492022 0.000109756271652438 1.300302408639e-103 0.00138952700446946 3.75547414435365e-07 2.01357089607565e-142 1.95034793045609e-05 1.17899313915906e-34 1.979686075101e-84 0.155922992161587 1.84405333725821e-09 1.2938856648278e-157 0.287184942020645 1.13855583545194e-12
X.Intercept._Std.Error 2.47963650425926 459.377029871511 16.6730992506847 1.83545587850906 414.64590052724 11.7754321199163 1.39936229104615 653.605248808668 1.77547715238971 1.24813311286047 351.723712331823 3.57067054655076 0.945826571483205 330.990098562007 2.52170174724026
X.Intercept._zvalue 16.2190704642262 3.06537456628437 -3.86794531108104 21.6149190855006 3.1968177282785 -5.08095230262336 25.4088465604759 -4.27050048939548 12.2786847788088 19.4698193008246 -1.41891776582847 6.01097984492387 26.7482543865813 -1.06431739156408 7.11262590992028
hgt0_Estimate 0.403139725682047 35.5765914326644 1.20995060148709 0.357976348181521 31.017270649736 1.50374470896815 0.460434521499931 59.1545587745054 0.412512139030997 0.515794899568958 46.2591615803456 0.520812513246497 0.510868687340372 45.5654716961748 0.534362107843996
hgt0_Pr…z.. 1.25009876505916e-13 0.00044580263634977 0.000971126494011876 2.8214126518022e-17 0.00131003033160538 3.70002169493517e-08 2.98739737306075e-37 0.000542570320025762 3.02226358149701e-20 8.57492957072496e-59 2.85614887356539e-07 1.10039023739786e-08 3.24936431571418e-102 6.34545453001148e-08 3.42500501257647e-17
hgt0_Std.Error 0.054394831296424 10.1318250571452 0.366789440586502 0.0423453726228281 9.65135595902108 0.273179527952845 0.0361031059207924 17.1025823111652 0.0447499166719814 0.0319035514862777 9.01263684091064 0.0911390672918101 0.0237991645880177 8.4243486539678 0.0633800587736384
hgt0_zvalue 7.41136089723565 3.51137048182399 3.29876072646029 8.45373003019788 3.21377335800617 5.50460248700526 12.7533216258482 3.45880859967487 9.2181655231836 16.1673191710588 5.13270005181653 5.7144814921021 21.4658243759356 5.40878275197143 8.43107624359371
prot_Estimate 0.859205733631884 98.9428234201429 -6.0245137913613 NA NA NA NA NA NA NA NA NA NA NA NA
prot_Pr…z.. 6.88427337868624e-19 2.09631602338248e-08 2.94171378765984e-20 NA NA NA NA NA NA NA NA NA NA NA NA
prot_Std.Error 0.0967928354474626 17.656195205247 0.653342710289674 NA NA NA NA NA NA NA NA NA NA NA NA
prot_zvalue 8.87674929306359 5.60385871757577 -9.22106223346427 NA NA NA NA NA NA NA NA NA NA NA NA
Sargan_df1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sexMale_Estimate 0.154043421788434 333.799680049256 5.41175429817601 0.106307556058088 330.452608866756 5.831189427888 1.80283907885784 997.747599807137 -0.452827875182574 1.02741625216021 411.365911332902 -0.789122421167278 1.02009164592613 409.820707458849 -0.746032636368026
sexMale_Pr…z.. 0.388078129326844 5.06413216619956e-24 5.80077629935398e-06 0.423490075743443 2.52735690942035e-27 6.12283824653399e-12 1.16893284800756e-65 2.02347084790526e-89 0.000647195788039458 1.69796551008425e-27 2.05327249428077e-54 0.00428270841485319 1.70848440094532e-51 2.36314216739175e-62 6.57521045475854e-05
sexMale_Std.Error 0.178475271469509 33.0216035385256 1.19371921154444 0.132821186086593 30.5174257712039 0.847955715223007 0.105343525210948 49.7632792630674 0.132754263303729 0.0945646985181949 26.482231353221 0.276250047248342 0.0675715533063684 24.5920104216274 0.186921458372094
sexMale_zvalue 0.863107928174524 10.1085242471589 4.5335236677428 0.800381017443866 10.8283251459096 6.87676174970345 17.1139049623383 20.049876426605 -3.41102322376305 10.8646912458832 15.5336574870179 -2.85655126226232 15.0964658352761 16.6647907361991 -3.99115565898775
svymthRound_Estimate 0.209901650858444 121.789859431719 4.84745570027421 0.32289383712906 135.494858749213 4.07024693316579 0.433164820953122 190.077351395411 0.013743826466697 1.00582859923507 218.549980922773 -0.369567838754919 0.929266902426876 207.078222946319 -0.0985678389223871
svymthRound_Pr…z.. 0.00846239710330183 5.96047652726604e-17 2.07373887990734e-19 9.66146445970411e-11 4.48931446156578e-34 5.64723572075311e-36 0 0 1.57416908709045e-66 0 0 2.42696379710431e-102 0 0 1.8456989795407e-27
svymthRound_Std.Error 0.0797183179466224 14.5577085129173 0.538050140686242 0.0498896912189897 11.133488331491 0.325043349284404 0.00120472816008757 0.739269879490012 0.000797655931686455 0.00746867714609322 1.93157117819081 0.0172056989832521 0.00539330635998891 1.46167854745869 0.00907867488118111
svymthRound_zvalue 2.63304164293819 8.3660048093182 9.00930105527276 6.47215545415433 12.1700274626388 12.5221664806451 359.553993426731 257.114967982378 17.2302692435809 134.672925279842 113.146221785871 -21.4793853545069 172.300040161039 141.671520941694 -10.857073329799
vars_var.y hgt wgt vil.id hgt wgt vil.id hgt wgt vil.id hgt wgt vil.id hgt wgt vil.id
vars_vars.c sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound sex+wgt0+hgt0+svymthRound
vars_vars.x prot prot prot cal cal cal wealthIdx wealthIdx wealthIdx p.A.prot p.A.prot p.A.prot p.A.nProt p.A.nProt p.A.nProt
vars_vars.z indi.id indi.id indi.id indi.id indi.id indi.id indi.id indi.id indi.id indi.id indi.id indi.id indi.id indi.id indi.id
Weakinstruments_df1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Weakinstruments_df2 18957 18962 18999 18957 18962 18999 25092 25102 30013 18587 18591 18845 18587 18591 18845
Weakinstruments_p.value 1.42153759924051e-19 4.45734829676535e-19 5.72345606958224e-20 1.77770827184557e-37 4.03760292921249e-37 5.4744773509349e-38 0 0 0 0 0 0 0 0 0
Weakinstruments_statistic 82.0931934821259 79.8251182827394 83.8989817367576 164.392129625297 162.747072038427 166.752606654978 7029.47383089423 7038.38467113138 12942.6315513376 1710.98122418593 1715.15052113401 1725.71954882909 5097.88462603714 5110.7741807339 5136.55662964893
wgt0_Estimate -0.00163274724538053 0.492582112313714 0.00999798623641609 -0.000658938519303158 0.601258436431589 0.00326074237566447 0.00112485055604174 1.27282038539716 -0.00512158791392231 0.000716628918445026 0.761704518610409 -0.00601345031606022 0.00092210011725941 0.792700893714032 -0.00668277875606412
wgt0_Pr…z.. 4.88365163658792e-08 2.3313655523177e-20 7.95432753715951e-07 0.000328431498069385 2.09211347307227e-48 0.00667886646010296 2.26123807436414e-11 6.67525280115101e-56 6.51923754467955e-127 2.43477572087317e-06 8.22014792868139e-69 5.19751747054247e-44 1.68237436783295e-15 4.81415543482617e-82 2.5484884022112e-105
wgt0_Std.Error 0.000299284876595226 0.0532753838702927 0.00202532507408109 0.000183457551985467 0.0411255751282266 0.0012021409416413 0.000168187467853392 0.0808047514011814 0.000213715312589844 0.000152036990659248 0.0434474820358988 0.000432182413699013 0.00011580150512101 0.0413159097814226 0.000306609919183132
wgt0_zvalue -5.45549532590907 9.24596082710511 4.93648469787117 -3.59177647456756 14.6200614716489 2.71244598924693 6.68807593335234 15.7518012657181 -23.964534182684 4.71351685755978 17.5316147891159 -13.9141485758099 7.96276452793799 19.186335189222 -21.7957030674948
Wu.Hausman_df1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Wu.Hausman_df2 18956 18961 18998 18956 18961 18998 25091 25101 30012 18586 18590 18844 18586 18590 18844
Wu.Hausman_p.value 1.53929570350174e-118 3.1341589140336e-08 0 2.88592506947287e-108 7.64959440843259e-07 0 0.0221987672063031 0.00993600230368616 0 1.80909125272665e-238 2.14946499922086e-35 0 3.15182965429387e-108 1.76811257415626e-17 0
Wu.Hausman_statistic 543.467268879861 30.6481856102737 5652.51924793073 494.955883488803 24.4605456761215 5583.56513052695 5.23078768861662 6.64734699528169 25949.7118056005 1119.87022468742 154.793296861585 4826.92242730011 494.903094649185 72.5307870103482 7607.83405438101
cal_Estimate NA NA NA 0.0238724384575233 2.71948246216963 -0.168054407187466 NA NA NA NA NA NA NA NA NA
cal_Pr…z.. NA NA NA 1.4495661654094e-33 9.21076021353428e-10 5.6761450167746e-39 NA NA NA NA NA NA NA NA NA
cal_Std.Error NA NA NA 0.00197718112736554 0.444177077283097 0.0128692506794762 NA NA NA NA NA NA NA NA NA
cal_zvalue NA NA NA 12.0739764946734 6.12251870088371 -13.0586007975956 NA NA NA NA NA NA NA NA NA
wealthIdx_Estimate NA NA NA NA NA NA 0.144503490136916 69.1816142882735 -1.91414470908346 NA NA NA NA NA NA
wealthIdx_Pr…z.. NA NA NA NA NA NA 3.72983264927879e-06 2.23442991283626e-07 0 NA NA NA NA NA NA
wealthIdx_Std.Error NA NA NA NA NA NA 0.0312379492766362 13.3588885513857 0.0371054140359332 NA NA NA NA NA NA
wealthIdx_zvalue NA NA NA NA NA NA 4.62589553677888 5.17869536991513 -51.5866689219473 NA NA NA NA NA NA
p.A.prot_Estimate NA NA NA NA NA NA NA NA NA 0.00148073028434634 0.221916473012471 -0.00520794333267228 NA NA NA
p.A.prot_Pr…z.. NA NA NA NA NA NA NA NA NA 2.50759287111904e-156 8.30126393445594e-33 3.00201193980421e-197 NA NA NA
p.A.prot_Std.Error NA NA NA NA NA NA NA NA NA 5.55884799941939e-05 0.0186022369560852 0.000173813943639702 NA NA NA
p.A.prot_zvalue NA NA NA NA NA NA NA NA NA 26.6373587567245 11.9295584469951 -29.9627476577358 NA NA NA
p.A.nProt_Estimate NA NA NA NA NA NA NA NA NA NA NA NA 0.0141317656200728 2.1185694049434 -0.0494468877742103
p.A.nProt_Pr…z.. NA NA NA NA NA NA NA NA NA NA NA NA 2.61782083799059e-226 4.81511329045864e-35 0
p.A.nProt_Std.Error NA NA NA NA NA NA NA NA NA NA NA NA 0.000440019589949139 0.17153115470459 0.00128926108222172
p.A.nProt_zvalue NA NA NA NA NA NA NA NA NA NA NA NA 32.1162192385715 12.3509307017258 -38.3528894620791

5.2.2.4 Program Line by Line

Set Up Parameters

vars.z <- c('indi.id')
vars.z <- NULL
vars.c <- c('sex', 'wgt0', 'hgt0', 'svymthRound')
5.2.2.4.1 Lapply
df.reg.out <- as_tibble(
  bind_rows(lapply(list.vars.y, regf.iv,
                   vars.x=var.x1, vars.c=vars.c, vars.z=vars.z, df=df)))
5.2.2.4.2 Nested Lapply Test
lapply(list.vars.y, function(y) (mean(df[[var.x1]], na.rm=TRUE) +
                                   mean(df[[y]], na.rm=TRUE)))
## [[1]]
## [1] 98.3272
## 
## [[2]]
## [1] 13626.51
## 
## [[3]]
## [1] 26.11226
lapplytwice <- lapply(
  list.vars.x, function(x) (
    lapply(list.vars.y, function(y) (mean(df[[x]], na.rm=TRUE) +
                                       mean(df[[y]], na.rm=TRUE)))))
# lapplytwice
5.2.2.4.3 Nested Lapply All
df.reg.out.all <- bind_rows(
  lapply(list.vars.x,
         function(x) (
           bind_rows(
             lapply(list.vars.y, regf.iv,
                    vars.x=x, vars.c=vars.c, vars.z=vars.z, df=df))
         )))
# df.reg.out.all %>%
#   kable() %>%
#   kable_styling_fc_wide()
5.2.2.4.4 Nested Lapply Select
df.reg.out.all <-
  (lapply(list.vars.x,
          function(x) (
            bind_rows(lapply(list.vars.y, regf.iv,
                             vars.x=x, vars.c=vars.c, vars.z=vars.z, df=df)) %>%
              select(vars_var.y, starts_with(x)) %>%
              select(vars_var.y, ends_with('value'))
          ))) %>% reduce(full_join)
df.reg.out.all %>%
  kable() %>%
  kable_styling_fc_wide()
vars_var.y prot_tvalue cal_tvalue wealthIdx_tvalue p.A.prot_tvalue p.A.nProt_tvalue
hgt 18.8756010031766 23.4421863484656 13.5088996182178 3.83682180045522 32.5448257554849
wgt 16.359112505607 17.368603130933 14.1390521528125 1.36958319982295 12.0961557911473
vil.id -14.9385580468905 -19.6150110809449 34.0972558327354 8.45943342783161 17.7801422421398

5.3 Decomposition

5.3.1 Decompose RHS

Go back to fan’s REconTools research support package, R4Econ examples page, PkgTestR packaging guide, or Stat4Econ course page.

One runs a number of regressions. With different outcomes, and various right hand side variables.

What is the remaining variation in the left hand side variable if right hand side variable one by one is set to the average of the observed values.

  • Dependency: R4Econ/linreg/ivreg/ivregdfrow.R

The code below does not work with categorical variables (except for dummies). Dummy variable inputs need to be converted to zero/one first. The examples are just to test the code with different types of variables.

# Library
library(tidyverse)
library(AER)

# Load Sample Data
setwd('C:/Users/fan/R4Econ/_data/')
df <- read_csv('height_weight.csv')

# Source Dependency
source('C:/Users/fan/R4Econ/linreg/ivreg/ivregdfrow.R')

Data Cleaning.

# Convert Variable for Sex which is categorical to Numeric
df <- df
df$male <- (as.numeric(factor(df$sex)) - 1)
summary(factor(df$sex))
## Female   Male 
##  16446  18619
summary(df$male)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   0.531   1.000   1.000
df.use <- df %>% filter(S.country == 'Guatemala') %>%
  filter(svymthRound %in% c(12, 18, 24))
dim(df.use)
## [1] 2022   16

Setting Up Parameters.

# Define Left Hand Side Variab les
var.y1 <- c('hgt')
var.y2 <- c('wgt')
vars.y <- c(var.y1, var.y2)
# Define Right Hand Side Variables
vars.x <- c('prot')
vars.c <- c('male', 'wgt0', 'hgt0', 'svymthRound')
# vars.z <- c('p.A.prot')
vars.z <- c('vil.id')
# vars.z <- NULL
vars.xc <- c(vars.x, vars.c)

# Other variables to keep
vars.other.keep <- c('S.country', 'vil.id', 'indi.id', 'svymthRound')

# Decompose sequence
vars.tomean.first <- c('male', 'hgt0')
var.tomean.first.name.suffix <- '_mh02m'
vars.tomean.second <- c(vars.tomean.first, 'hgt0', 'wgt0')
var.tomean.second.name.suffix <- '_mh0me2m'
vars.tomean.third <- c(vars.tomean.second, 'prot')
var.tomean.third.name.suffix <- '_mh0mep2m'
vars.tomean.fourth <- c(vars.tomean.third, 'svymthRound')
var.tomean.fourth.name.suffix <- '_mh0mepm2m'
list.vars.tomean = list(
#                         vars.tomean.first,
                        vars.tomean.second,
                        vars.tomean.third,
                        vars.tomean.fourth
                        )
list.vars.tomean.name.suffix <- list(
#                                     var.tomean.first.name.suffix,
                                     var.tomean.second.name.suffix,
                                     var.tomean.third.name.suffix,
                                     var.tomean.fourth.name.suffix
                                    )

5.3.1.1 Obtain Regression Coefficients from somewhere

# Regressions
# regf.iv from C:\Users\fan\R4Econ\linreg\ivreg\ivregdfrow.R
df.reg.out <- as_tibble(
  bind_rows(lapply(vars.y, regf.iv,
                   vars.x=vars.x, vars.c=vars.c, vars.z=vars.z, df=df)))
# Regressions
# reg1 <- regf.iv(var.y = var.y1, vars.x, vars.c, vars.z, df.use)
# reg2 <- regf.iv(var.y = var.y2, vars.x, vars.c, vars.z, df.use)
# df.reg.out <- as_tibble(bind_rows(reg1, reg2))
# df.reg.out
# Select Variables
str.esti.suffix <- '_Estimate'
arr.esti.name <- paste0(vars.xc, str.esti.suffix)
str.outcome.name <- 'vars_var.y'
arr.columns2select <- c(arr.esti.name, str.outcome.name)
arr.columns2select
## [1] "prot_Estimate"        "male_Estimate"        "wgt0_Estimate"       
## [4] "hgt0_Estimate"        "svymthRound_Estimate" "vars_var.y"
# Generate dataframe for coefficients
df.coef <- df.reg.out[,c(arr.columns2select)] %>%
  mutate_at(vars(arr.esti.name), as.numeric) %>% column_to_rownames(str.outcome.name)
df.coef %>%
  kable() %>%
  kable_styling_fc()
prot_Estimate male_Estimate wgt0_Estimate hgt0_Estimate svymthRound_Estimate
hgt -0.2714772 1.244735 0.0004430 0.6834853 1.133919
wgt -59.0727542 489.852902 0.7696158 75.4867897 250.778883
str(df.coef)
## 'data.frame':    2 obs. of  5 variables:
##  $ prot_Estimate       : num  -0.271 -59.073
##  $ male_Estimate       : num  1.24 489.85
##  $ wgt0_Estimate       : num  0.000443 0.769616
##  $ hgt0_Estimate       : num  0.683 75.487
##  $ svymthRound_Estimate: num  1.13 250.78

5.3.1.2 Decomposition Step 1

# Decomposition Step 1: gather
df.decompose_step1 <- df.use %>%
                        filter(svymthRound %in% c(12, 18, 24)) %>%
                        select(one_of(c(vars.other.keep, vars.xc, vars.y))) %>%
                        drop_na() %>%
                        gather(variable, value, -one_of(c(vars.other.keep, vars.xc)))
options(repr.matrix.max.rows=20, repr.matrix.max.cols=20)
dim(df.decompose_step1)
## [1] 1382   10
head(df.decompose_step1, 10) %>%
  kable() %>%
  kable_styling_fc()
S.country vil.id indi.id svymthRound prot male wgt0 hgt0 variable value
Guatemala 3 1352 18 13.3 1 2545.2 47.4 hgt 70.2
Guatemala 3 1352 24 46.3 1 2545.2 47.4 hgt 75.8
Guatemala 3 1354 12 1.0 1 3634.3 51.2 hgt 66.3
Guatemala 3 1354 18 9.8 1 3634.3 51.2 hgt 69.2
Guatemala 3 1354 24 15.4 1 3634.3 51.2 hgt 75.3
Guatemala 3 1356 12 8.6 1 3911.8 51.9 hgt 68.1
Guatemala 3 1356 18 17.8 1 3911.8 51.9 hgt 74.1
Guatemala 3 1356 24 30.5 1 3911.8 51.9 hgt 77.1
Guatemala 3 1357 12 1.0 1 3791.4 52.6 hgt 71.5
Guatemala 3 1357 18 12.7 1 3791.4 52.6 hgt 77.8

5.3.1.3 Decomposition Step 2

# Decomposition Step 2: mutate_at(vars, funs(mean = mean(.)))
# the xc averaging could have taken place earlier, no difference in mean across variables
df.decompose_step2 <- df.decompose_step1 %>%
                        group_by(variable) %>%
                        mutate_at(vars(c(vars.xc, 'value')), funs(mean = mean(.))) %>%
                        ungroup()

options(repr.matrix.max.rows=20, repr.matrix.max.cols=20)
dim(df.decompose_step2)
## [1] 1382   16
head(df.decompose_step2,10) %>%
  kable() %>%
  kable_styling_fc_wide()
S.country vil.id indi.id svymthRound prot male wgt0 hgt0 variable value prot_mean male_mean wgt0_mean hgt0_mean svymthRound_mean value_mean
Guatemala 3 1352 18 13.3 1 2545.2 47.4 hgt 70.2 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216
Guatemala 3 1352 24 46.3 1 2545.2 47.4 hgt 75.8 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216
Guatemala 3 1354 12 1.0 1 3634.3 51.2 hgt 66.3 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216
Guatemala 3 1354 18 9.8 1 3634.3 51.2 hgt 69.2 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216
Guatemala 3 1354 24 15.4 1 3634.3 51.2 hgt 75.3 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216
Guatemala 3 1356 12 8.6 1 3911.8 51.9 hgt 68.1 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216
Guatemala 3 1356 18 17.8 1 3911.8 51.9 hgt 74.1 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216
Guatemala 3 1356 24 30.5 1 3911.8 51.9 hgt 77.1 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216
Guatemala 3 1357 12 1.0 1 3791.4 52.6 hgt 71.5 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216
Guatemala 3 1357 18 12.7 1 3791.4 52.6 hgt 77.8 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216

5.3.1.4 Decomposition Step 3 Non-Loop

ff_lr_decompose_valadj <- function(df, df.coef, vars.tomean, str.esti.suffix) {
  new_value <- (df$value +
                  rowSums((df[paste0(vars.tomean, '_mean')] - df[vars.tomean])
                          *df.coef[df$variable, paste0(vars.tomean, str.esti.suffix)]))
  return(new_value)
}

5.3.1.5 Decomposition Step 3 With Loop

df.decompose_step3 <- df.decompose_step2
for (i in 1:length(list.vars.tomean)) {
    var.decomp.cur <- (paste0('value', list.vars.tomean.name.suffix[[i]]))
    vars.tomean <- list.vars.tomean[[i]]
    var.decomp.cur
    df.decompose_step3 <- df.decompose_step3 %>%
      mutate((!!var.decomp.cur) :=
               ff_lr_decompose_valadj(., df.coef, vars.tomean, str.esti.suffix))

}

dim(df.decompose_step3)
## [1] 1382   19
head(df.decompose_step3, 10) %>%
  kable() %>%
  kable_styling_fc_wide()
S.country vil.id indi.id svymthRound prot male wgt0 hgt0 variable value prot_mean male_mean wgt0_mean hgt0_mean svymthRound_mean value_mean value_mh0me2m value_mh0mep2m value_mh0mepm2m
Guatemala 3 1352 18 13.3 1 2545.2 47.4 hgt 70.2 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216 73.19390 71.19903 71.68148
Guatemala 3 1352 24 46.3 1 2545.2 47.4 hgt 75.8 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216 78.79390 85.75778 79.43671
Guatemala 3 1354 12 1.0 1 3634.3 51.2 hgt 66.3 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216 63.61689 58.28285 65.56882
Guatemala 3 1354 18 9.8 1 3634.3 51.2 hgt 69.2 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216 66.51689 63.57185 64.05430
Guatemala 3 1354 24 15.4 1 3634.3 51.2 hgt 75.3 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216 72.61689 71.19213 64.87106
Guatemala 3 1356 12 8.6 1 3911.8 51.9 hgt 68.1 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216 64.33707 61.06626 68.35222
Guatemala 3 1356 18 17.8 1 3911.8 51.9 hgt 74.1 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216 70.33707 69.56385 70.04630
Guatemala 3 1356 24 30.5 1 3911.8 51.9 hgt 77.1 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216 73.33707 76.01161 69.69055
Guatemala 3 1357 12 1.0 1 3791.4 52.6 hgt 71.5 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216 66.83353 61.49949 68.78545
Guatemala 3 1357 18 12.7 1 3791.4 52.6 hgt 77.8 20.64819 0.5499276 3312.297 49.75137 18.42547 73.41216 73.13353 70.97578 71.45823

5.3.1.6 Decomposition Step 4 Variance

df.decompose_step3 %>%
        select(variable, contains('value')) %>%
        group_by(variable) %>%
        summarize_all(funs(mean = mean, var = var)) %>%
        select(matches('value')) %>% select(ends_with("_var")) %>%
        mutate_if(is.numeric, funs( frac = (./value_var))) %>%
        mutate_if(is.numeric, round, 3) %>%
  kable() %>%
  kable_styling_fc_wide()
value_var value_mean_var value_mh0me2m_var value_mh0mep2m_var value_mh0mepm2m_var value_var_frac value_mean_var_frac value_mh0me2m_var_frac value_mh0mep2m_var_frac value_mh0mepm2m_var_frac
21.864 NA 25.35 49.047 23.06 1 NA 1.159 2.243 1.055
2965693.245 NA 2949187.64 4192769.518 3147506.60 1 NA 0.994 1.414 1.061

5.3.1.7 Graphical Results

Graphically, difficult to pick up exact differences in variance, a 50 percent reduction in variance visually does not look like 50 percent. Intuitively, we are kind of seeing standard deviation, not variance on the graph if we think abou the x-scale.

head(df.decompose_step3 %>%
    select(variable, contains('value'), -value_mean), 10) %>%
  kable() %>%
  kable_styling_fc()
variable value value_mh0me2m value_mh0mep2m value_mh0mepm2m
hgt 70.2 73.19390 71.19903 71.68148
hgt 75.8 78.79390 85.75778 79.43671
hgt 66.3 63.61689 58.28285 65.56882
hgt 69.2 66.51689 63.57185 64.05430
hgt 75.3 72.61689 71.19213 64.87106
hgt 68.1 64.33707 61.06626 68.35222
hgt 74.1 70.33707 69.56385 70.04630
hgt 77.1 73.33707 76.01161 69.69055
hgt 71.5 66.83353 61.49949 68.78545
hgt 77.8 73.13353 70.97578 71.45823
df.decompose_step3 %>%
    select(variable, contains('value'), -value_mean) %>%
    rename(outcome = variable) %>%
    gather(variable, value, -outcome) %>%
    ggplot(aes(x=value, color = variable, fill = variable)) +
        geom_line(stat = "density") +
        facet_wrap(~ outcome, scales='free', nrow=2)

5.3.1.8 Additional Decomposition Testings

head(df.decompose_step2[vars.tomean.first],3)
head(df.decompose_step2[paste0(vars.tomean.first, '_mean')], 3)
head(df.coef[df.decompose_step2$variable,
             paste0(vars.tomean.first, str.esti.suffix)], 3)
df.decompose.tomean.first <- df.decompose_step2 %>%
    mutate(pred_new = df.decompose_step2$value +
        rowSums((df.decompose_step2[paste0(vars.tomean.first, '_mean')]
                 - df.decompose_step2[vars.tomean.first])
            *df.coef[df.decompose_step2$variable,
                     paste0(vars.tomean.first, str.esti.suffix)])) %>%
        select(variable, value, pred_new)
head(df.decompose.tomean.first, 10)
df.decompose.tomean.first %>%
        group_by(variable) %>%
        summarize_all(funs(mean = mean, sd = sd))  %>%
  kable() %>%
  kable_styling_fc()
variable value_mean pred_new_mean value_sd pred_new_sd
hgt 73.41216 73.41216 4.675867 4.534947
wgt 8807.87656 8807.87656 1722.118824 1695.221845

Note the r-square from regression above matches up with the 1 - ratio below. This is the proper decomposition method that is equivalent to r2.

df.decompose_step2 %>%
    mutate(pred_new = df.decompose_step2$value +
        rowSums((df.decompose_step2[paste0(vars.tomean.second, '_mean')]
                 - df.decompose_step2[vars.tomean.second])
            *df.coef[df.decompose_step2$variable,
                     paste0(vars.tomean.second, str.esti.suffix)])) %>%
        select(variable, value, pred_new) %>%
        group_by(variable) %>%
        summarize_all(funs(mean = mean, var = var)) %>%
        mutate(ratio = (pred_new_var/value_var))  %>%
  kable() %>%
  kable_styling_fc()
variable value_mean pred_new_mean value_var pred_new_var ratio
hgt 73.41216 73.41216 2.186374e+01 25.3504 1.1594724
wgt 8807.87656 8807.87656 2.965693e+06 2949187.6357 0.9944345