Go to the RMD, R, PDF, or HTML version of this file. Go back to fan’s REconTools research support package, R4Econ examples page, PkgTestR packaging guide, or Stat4Econ course page.
First, we tabulate a dataset, and show categories as rows, and display frequencies.
# We use the mtcars dataset
tb_tab_joint <- mtcars %>%
group_by(gear, am) %>%
tally()
# Display
tb_tab_joint %>%
kable(caption = "cross tabulation, stacked") %>%
kable_styling_fc()
gear | am | n |
---|---|---|
3 | 0 | 15 |
4 | 0 | 4 |
4 | 1 | 8 |
5 | 1 | 5 |
We can present this as cross tabs.
# We use the mtcars dataset
tb_cross_tab <- mtcars %>%
group_by(gear, am) %>%
tally() %>%
spread(am, n)
# Display
tb_cross_tab %>%
kable(caption = "cross tabulation") %>%
kable_styling_fc()
gear | 0 | 1 |
---|---|---|
3 | 15 | NA |
4 | 4 | 8 |
5 | NA | 5 |
We have two variables variables, am and mpg, the mpg values are not unique. We want to know how many unique mpg levels are there for each am group. We use the dplyr::distinct function to achieve this.
tb_dist_tab <- mtcars %>%
# .keep_all to keep all variables
distinct(am, mpg, .keep_all = TRUE) %>%
group_by(am) %>%
tally()
# Display
tb_dist_tab %>%
kable(caption = "Tabulate distinct groups") %>%
kable_styling_fc()
am | n |
---|---|
0 | 16 |
1 | 11 |
There are \(N\) individuals, each observed for \(Y_i\) years. We start with a dataframe where individuals are the unit of observation, we expand this to a panel with a row for each of the years that the individual is in the survey for.
Algorithm:
First, we construct the dataframe where each row is an individual.
# 1. Array of Years in the Survey
ar_years_in_survey <- c(2, 3, 1, 10, 2, 5)
ar_start_yaer <- c(1, 2, 3, 1, 1, 1)
ar_end_year <- c(2, 4, 3, 10, 2, 5)
mt_combine <- cbind(ar_years_in_survey, ar_start_yaer, ar_end_year)
# This is the individual attribute dataset, attributes that are invariant acrosss years
tb_indi_attributes <- as_tibble(mt_combine) %>% rowid_to_column(var = "ID")
# Display
tb_indi_attributes %>%
head(10) %>%
kable() %>%
kable_styling_fc()
ID | ar_years_in_survey | ar_start_yaer | ar_end_year |
---|---|---|---|
1 | 2 | 1 | 2 |
2 | 3 | 2 | 4 |
3 | 1 | 3 | 3 |
4 | 10 | 1 | 10 |
5 | 2 | 1 | 2 |
6 | 5 | 1 | 5 |
Second, we change the dataframe so that each unit of observation is an individual in an year. This means we will duplicate the information in the prior table, so if an individual appears for 4 years in the survey, we will now have four rows for this individual. We generate a new variable that is the calendar year. This is now a panel dataset.
# 2. Sort and generate variable equal to sorted index
tb_indi_panel <- tb_indi_attributes %>% uncount(ar_years_in_survey)
# 3. Panel now construct exactly which year in survey, note that all needed is sort index
# Note sorting not needed, all rows identical now
tb_indi_panel <- tb_indi_panel %>%
group_by(ID) %>%
mutate(yr_in_survey = row_number())
tb_indi_panel <- tb_indi_panel %>%
mutate(calendar_year = yr_in_survey + ar_start_yaer - 1)
# Show results Head 10
tb_indi_panel %>%
head(10) %>%
kable() %>%
kable_styling_fc()
ID | ar_start_yaer | ar_end_year | yr_in_survey | calendar_year |
---|---|---|---|---|
1 | 1 | 2 | 1 | 1 |
1 | 1 | 2 | 2 | 2 |
2 | 2 | 4 | 1 | 2 |
2 | 2 | 4 | 2 | 3 |
2 | 2 | 4 | 3 | 4 |
3 | 3 | 3 | 1 | 3 |
4 | 1 | 10 | 1 | 1 |
4 | 1 | 10 | 2 | 2 |
4 | 1 | 10 | 3 | 3 |
4 | 1 | 10 | 4 | 4 |