Go to the RMD, R, PDF, or HTML version of this file. Go back to fan’s REconTools Package, R Code Examples Repository (bookdown site), or Intro Stats with R Repository (bookdown site).
There is a variable, sort by this variable, then generate index from 1 to N representing sorted values of this index. If there are repeating values, still assign index, different index each value.
# Sort and generate variable equal to sorted index
df_iris <- iris %>% arrange(Sepal.Length) %>%
mutate(Sepal.Len.Index = row_number()) %>%
select(Sepal.Length, Sepal.Len.Index, everything())
# Show results Head 10
df_iris %>% head(10) %>%
kable() %>%
kable_styling_fc_wide()
Sepal.Length | Sepal.Len.Index | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|---|
4.3 | 1 | 3.0 | 1.1 | 0.1 | setosa |
4.4 | 2 | 2.9 | 1.4 | 0.2 | setosa |
4.4 | 3 | 3.0 | 1.3 | 0.2 | setosa |
4.4 | 4 | 3.2 | 1.3 | 0.2 | setosa |
4.5 | 5 | 2.3 | 1.3 | 0.3 | setosa |
4.6 | 6 | 3.1 | 1.5 | 0.2 | setosa |
4.6 | 7 | 3.4 | 1.4 | 0.3 | setosa |
4.6 | 8 | 3.6 | 1.0 | 0.2 | setosa |
4.6 | 9 | 3.2 | 1.4 | 0.2 | setosa |
4.7 | 10 | 3.2 | 1.3 | 0.2 | setosa |
We would like to calculate for example the ratio of each individual’s highest to the the person with the lowest height in a dataset. We first need to generated sorted index from lowest to highest, and then populate the lowest height to all rows, and then divide.
Search Terms:
Links:
We just want the lowest value to be in its own column, so that we can compute various statistics using the lowest value variable and the original variable.
# 1. Sort
df_iris_m1 <- iris %>% mutate(Sepal.Len.Lowest.all = min(Sepal.Length)) %>%
select(Sepal.Length, Sepal.Len.Lowest.all, everything())
# Show results Head 10
df_iris_m1 %>% head(10) %>%
kable() %>%
kable_styling_fc_wide()
Sepal.Length | Sepal.Len.Lowest.all | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|---|
5.1 | 4.3 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 4.3 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 4.3 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 4.3 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 4.3 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 4.3 | 3.9 | 1.7 | 0.4 | setosa |
4.6 | 4.3 | 3.4 | 1.4 | 0.3 | setosa |
5.0 | 4.3 | 3.4 | 1.5 | 0.2 | setosa |
4.4 | 4.3 | 2.9 | 1.4 | 0.2 | setosa |
4.9 | 4.3 | 3.1 | 1.5 | 0.1 | setosa |
This is the long method, using row_number, and case_when. The benefit of this method is that it generates several intermediate variables that might be useful. And the key final step is to set a new variable (A=Sepal.Len.Lowest.all) equal to another variable’s (B=Sepal.Length’s) value at the index that satisfies condition based a third variable (C=Sepal.Len.Index).
# 1. Sort
# 2. generate index
# 3. value at lowest index (case_when)
# 4. spread value from lowest index to other rows
# Note step 4 does not require step 3
df_iris_m2 <- iris %>% arrange(Sepal.Length) %>%
mutate(Sepal.Len.Index = row_number()) %>%
mutate(Sepal.Len.Lowest.one =
case_when(row_number()==1 ~ Sepal.Length)) %>%
mutate(Sepal.Len.Lowest.all =
Sepal.Length[Sepal.Len.Index==1]) %>%
select(Sepal.Length, Sepal.Len.Index,
Sepal.Len.Lowest.one, Sepal.Len.Lowest.all)
# Show results Head 10
df_iris_m2 %>% head(10) %>%
kable() %>%
kable_styling_fc_wide()
Sepal.Length | Sepal.Len.Index | Sepal.Len.Lowest.one | Sepal.Len.Lowest.all |
---|---|---|---|
4.3 | 1 | 4.3 | 4.3 |
4.4 | 2 | NA | 4.3 |
4.4 | 3 | NA | 4.3 |
4.4 | 4 | NA | 4.3 |
4.5 | 5 | NA | 4.3 |
4.6 | 6 | NA | 4.3 |
4.6 | 7 | NA | 4.3 |
4.6 | 8 | NA | 4.3 |
4.6 | 9 | NA | 4.3 |
4.7 | 10 | NA | 4.3 |
Generate Positive and Negative Index based on Ordered Deviation from some Number.
There is a variable that is continuous, substract a number from this variable, and generate index based on deviations. Think of the index as generating intervals indicating where the value lies. 0th index indicates the largest value in sequence that is smaller than or equal to number \(x\), 1st index indicates the smallest value in sequence that is larger than number \(x\).
The solution below is a little bit convoluated and long, there is likely a much quicker way. The process below shows various intermediary outputs that help arrive at deviation index Sepal.Len.Devi.Index from initial sorted index Sepal.Len.Index.
search:
# 1. Sort and generate variable equal to sorted index
# 2. Plus or minus deviations from some value
# 3. Find the zero, which means, the number closests to zero including zero from the negative side
# 4. Find the index at the highest zero and below deviation point
# 5. Difference of zero index and original sorted index
sc_val_x <- 4.65
df_iris_deviate <- iris %>% arrange(Sepal.Length) %>%
mutate(Sepal.Len.Index = row_number()) %>%
mutate(Sepal.Len.Devi = (Sepal.Length - sc_val_x)) %>%
mutate(Sepal.Len.Devi.Neg =
case_when(Sepal.Len.Devi <= 0 ~ (-1)*(Sepal.Len.Devi))) %>%
arrange((Sepal.Len.Devi.Neg), desc(Sepal.Len.Index)) %>%
mutate(Sepal.Len.Index.Zero =
case_when(row_number() == 1 ~ Sepal.Len.Index)) %>%
mutate(Sepal.Len.Devi.Index =
Sepal.Len.Index - Sepal.Len.Index.Zero[row_number() == 1]) %>%
arrange(Sepal.Len.Index) %>%
select(Sepal.Length, Sepal.Len.Index, Sepal.Len.Devi,
Sepal.Len.Devi.Neg, Sepal.Len.Index.Zero, Sepal.Len.Devi.Index)
# Show results Head 10
df_iris_deviate %>% head(20) %>%
kable() %>%
kable_styling_fc_wide()
Sepal.Length | Sepal.Len.Index | Sepal.Len.Devi | Sepal.Len.Devi.Neg | Sepal.Len.Index.Zero | Sepal.Len.Devi.Index |
---|---|---|---|---|---|
4.3 | 1 | -0.35 | 0.35 | NA | -8 |
4.4 | 2 | -0.25 | 0.25 | NA | -7 |
4.4 | 3 | -0.25 | 0.25 | NA | -6 |
4.4 | 4 | -0.25 | 0.25 | NA | -5 |
4.5 | 5 | -0.15 | 0.15 | NA | -4 |
4.6 | 6 | -0.05 | 0.05 | NA | -3 |
4.6 | 7 | -0.05 | 0.05 | NA | -2 |
4.6 | 8 | -0.05 | 0.05 | NA | -1 |
4.6 | 9 | -0.05 | 0.05 | 9 | 0 |
4.7 | 10 | 0.05 | NA | NA | 1 |
4.7 | 11 | 0.05 | NA | NA | 2 |
4.8 | 12 | 0.15 | NA | NA | 3 |
4.8 | 13 | 0.15 | NA | NA | 4 |
4.8 | 14 | 0.15 | NA | NA | 5 |
4.8 | 15 | 0.15 | NA | NA | 6 |
4.8 | 16 | 0.15 | NA | NA | 7 |
4.9 | 17 | 0.25 | NA | NA | 8 |
4.9 | 18 | 0.25 | NA | NA | 9 |
4.9 | 19 | 0.25 | NA | NA | 10 |
4.9 | 20 | 0.25 | NA | NA | 11 |