# Week 4 Overfitting

The fourth week covers Chapter 7 (Ulysses’ Compass).

## 4.1 Lectures

Lecture 7:

Lecture 8:

## 4.2 Exercises

### 4.2.1 Chapter 7

7E1.State the three motivating criteria that define information entropy. Try to express each in your own words.

Information is defined as *the reduction in uncertainty when we learn and outcome*. The motivating criteria for defining information entropy revolve around the measure of uncertainty that is used to derive information.

The first is that the measure of uncertainty must be continuous. This prevents large changes in the uncertainty measure resulting from relatively small changes in probabilities. Such a phenomenon often occurs when researchers use a *p*-value cutoff of .05 to claim “significance.” Often, the difference between “significant” and “non-significant” results is itself non-significant 🤯.

The second is that the measure of uncertainty should increase as the number of possible events increases. When there are more potential outcomes, there are more predictions that have to be made, and therefore more uncertainty about which outcome will be observed. For example, if your friend asks you to guess a number between 1 and 100, you are much less likely to guess correctly than if you were guessing a number between 1 and 2.

The third and final criteria is that the measure of uncertainty should be additive. These means that if we calculate the uncertainty for two sets of outcomes (e.g., heads or tail on a coin flip and the results of a thrown die), then the uncertainty of combinations of events (e.g., heads and “3”) should be equal to the sum of the uncertainties from the two separate events.

Information entropy is the only function that satisfies all three criteria.

7E2.Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

Entropy is the average log-probability of an event. The formula is given as

\[\begin{equation} H(p) = -\text{E}\log(p_i) = -\sum_{i=1}^np_i\log(p_i) \end{equation}\]

Thus, for each probability \(p_i\), we multiply \(p_i\) by \(\log(p_i)\), sum all the values, and then multiply the sum by negative one. To implement this, we’ll first write a couple of functions to do the calculations. We could do this without functions, but functions will allow us to handle cases where \(p_i = 0\), as will be the case in a couple of problems. The first function, `p_logp()`

, returns `0`

if `p`

is 0, and returns `p * log(p)`

otherwise. The `calc_entropy()`

function is a wrapper around `p_logp()`

, applying `p_logp()`

to each element of a vector of probabilities, summing the results, and multiplying the sum by -1.

```
p_logp <- function(p) {
if (p == 0) return(0)
p * log(p)
}
calc_entropy <- function(x) {
avg_logprob <- sum(map_dbl(x, p_logp))
-1 * avg_logprob
}
```

Applying these functions to the probabilities in this problem results in an entropy of about 0.61. Note this is the same as the weather example in the text, because in both cases there were two events with probabilities of 0.3 and 0.7.

7E3.Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

Now we have four outcomes. We can reuse our code from above, substituting the new probabilities into the vector `probs`

. This results in an entropy of about 1.38. As expected, because there are now more outcomes, the entropy is higher than what was observed in the previous problem.

7E4.Suppose another four-sided die is loaded such that it never shows “4.” The other three sides show equally often. What is the entropy of this die?

Again, we can copy our code from above, replace the probabilities. Even though there are four outcomes specified, there are effectively three outcomes, as the outcome “4” has probability 0. Thus, we would expect entropy to decrease, as there are fewer possible outcomes than in the previous problem. This is indeed what we find, as this die’s entropy is about 1.1.

```
probs <- c(1, 1, 1, 0)
probs <- probs / sum(probs)
probs
#> [1] 0.333 0.333 0.333 0.000
calc_entropy(probs)
#> [1] 1.1
```

7M1.Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

The AIC is defined as follows, where \(\text{lppd}\) is the log-pointwise-predictive density, and \(p\) is the number of free parameters in the posterior distribution.

\[ \text{AIC} = -2\text{lppd} + 2p \]

In contrast, the WAIC is defined as:

\[ \text{WAIC}(y,\Theta) = -2\Big(\text{lppd} - \sum_i \text{var}_{\theta}\log p(y_i | \theta)\Big) \]

If we distribute the \(-2\) through, this looks remarkably similar to the AIC formula, with the exception of the final \(p\) term. Whereas the AIC uses 2 times the number of free parameters, the WAIC uses 2 times the sum of the log-probability variances from each observation.

The WAIC is more general than the AIC, as the AIC assumes that priors are flat or overwhelmed by the likelihood, the posterior distribution is approximately multivariate Gaussian, and the sample size is much greater than the number of parameters. If all of these assumptions are met, then we would expect the AIC and WAIC to be about the same.

7M2.Explain the difference between modelselectionand modelcomparison. What information is lost under model selection?

Model selection refers to just picking the model that has the lowest (i.e., best) criterion value and discarding other models. When we take this approach, we lose information about the relative model accuracy that can be seen across the criterion values for the candidate models. This information can inform how confident we are in the models. Additionally, the model selection paradigm cares only about predictive accuracy and ignores causal inference. Thus, a model may be selected that has confounds or that would not correctly inform an intervention.

In contrast, model comparison uses multiple models to understand how the variables included influence prediction and affect implied conditional independencies in a causal model. Thus, we preserve information and can make more holistic judgments about our data and models.

7M3.When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

All of the information criteria are defined based on the log-pointwise-predictive density, defined as follows, where \(y\) is the data, \(\Theta\) is the posterior distribution, \(S\) is the number of samples, and \(I\) is the number of samples.

\[ \text{lppd}(y,\Theta) = \sum_i\log\frac{1}{S}\sum_sp(y_i|\Theta_s) \]

In words, this means take the log of the average probability across samples of each observation \(i\) and sum them together. Thus, a larger sample size will necessarily lead to a smaller log-pointwise-predictive-density, even if the data generating process and models are exactly equivalent (i.e., when the LPPD values are negative, the sum will get more negative as the sample size increases). More observations are entered into the sum, leading to a smaller final lppd, which will in turn increase the information criteria. We can run a quick simulation to demonstrate. For three different sample sizes, we’ll simulate 100 data sets, estimate a linear model, and then calculate the LPPD, WAIC, and PSIS for each.

```
set.seed(2020)
sample_sim <- tibble(sample_size = rep(c(100, 500, 1000), each = 100)) %>%
mutate(
sample_data = map(sample_size,
function(n) {
tibble(x1 = rnorm(n = n)) %>%
mutate(y = rnorm(n = n, mean = 0.3 + 0.8 * x1),
across(everything(), standardize))
}),
model = map(sample_data,
function(df) {
mod <- quap(alist(y ~ dnorm(mu, sigma),
mu <- alpha + beta * x1,
alpha ~ dnorm(0, 0.2),
beta ~ dnorm(0, 0.5),
sigma ~ dexp(1)),
data = df, start = list(alpha = 0, beta = 0))
return(mod)
}),
lppd = map_dbl(model, ~sum(rethinking::lppd(.x))),
infc = map(model,
function(mod) {
w <- rethinking::WAIC(mod)
p <- suppressMessages(rethinking::PSIS(mod))
tibble(waic = w$WAIC, psis = p$PSIS)
})
) %>%
unnest(infc)
```

Now, we can visualize the distribution of the LPPD, WAIC, and PSIS. As predicted, the LPPD gets more negative as the sample size increases, even though the data generation process and estimated model are identical. Accordingly, the WAIC and PSIS increase. Note that the WAIC and PSIS values are approximately \(-2 \times \text{lppd}\). Thus, if we fit one model with 100 observations and second model with 1,000 observations, we might conclude from the WAIC and PSIS that the first model with 100 observations has much better predictive accuracy, because the WAIC and PSIS values are lower. However, this would be only an artifact of different sample sizes, and may not actually represent true differences between the models.

```
sample_sim %>%
pivot_longer(cols = c(lppd, waic, psis)) %>%
mutate(sample_size = glue("N = {sample_size}"),
sample_size = fct_inorder(sample_size),
name = str_to_upper(name),
name = fct_inorder(name)) %>%
ggplot(aes(x = value)) +
facet_grid(rows = vars(sample_size), cols = vars(name), scales = "free_x") +
geom_histogram(aes(y = stat(density)), binwidth = 50) +
labs(x = "Value", y = "Density") +
theme(panel.border = element_rect(fill = NA))
```

7M4.What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

The penalty term of the WAIC, \(p_{\Tiny\text{WAIC}}\) is defined as shown in the WAIC formula. Specifically, the penalty term is sum of the variances of the log probabilities for each observation.

\[ \text{WAIC}(y,\Theta) = -2\Big(\text{lppd} - \underbrace{\sum_i \text{var}_{\theta}\log p(y_i | \theta)}_{\text{penalty term}}\Big) \]

Smaller variances in log probabilities will results in a lower penalty. If we restrict the prior to become more concentrated, we restrict the plausible range of the parameters. In other words, we restrict the variability in the posterior distribution. As the parameters become more consistent, the log probability of each observation will necessarily become more consistent also. Thus, the penalty term, or effective number of parameters, becomes smaller. We can again confirm with a small simulation.

```
set.seed(2020)
prior_sim <- tibble(prior_sd = rep(c(0.1, 1, 10), each = 100)) %>%
mutate(
sample_data = map(1:n(),
function(x) {
n <- 20
tibble(x1 = rnorm(n = n),
x2 = rnorm(n = n),
x3 = rnorm(n = n)) %>%
mutate(y = rnorm(n = n, mean = 0.3 + 0.8 * x1 +
0.6 * x2 + 1.2 * x3),
across(everything(), standardize))
}),
model = map2(sample_data, prior_sd,
function(df, p_sd) {
mod <- brm(y ~ 1 + x1 + x2 + x3, data = df,
family = gaussian,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior_string(glue("normal(0, {p_sd})"), class = "b"),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 3000, chains = 4, cores = 4,
seed = 1234)
return(mod)
}),
infc = map(model,
function(mod) {
w <- suppressWarnings(brms::waic(mod))
p <- suppressWarnings(brms::loo(mod))
tibble(p_waic = w$estimates["p_waic", "Estimate"],
p_loo = p$estimates["p_loo", "Estimate"])
})) %>%
unnest(infc)
```

Visualizing the results, we can see that the more constricted prior does indeed result in a smaller penalty or effective number of parameters.

```
prior_sim %>%
pivot_longer(cols = c(p_waic, p_loo)) %>%
mutate(prior_sd = glue("sd~'='~{prior_sd}"),
prior_sd = fct_inorder(prior_sd),
name = factor(name, levels = c("p_waic", "p_loo"),
labels = c("p[WAIC]", "p[PSIS]"))) %>%
ggplot(aes(x = value)) +
facet_grid(rows = vars(prior_sd), cols = vars(name),
labeller = label_parsed) +
geom_histogram(aes(y = stat(density)), binwidth = 0.2) +
labs(x = "Value", y = "Density") +
theme(panel.border = element_rect(fill = NA))
```

7M5.Provide an informal explanation of why informative priors reduce overfitting.

Informative priors restrict the plausible values for parameters. By using informative priors, we can limit the values of parameters to values that are reasonable, given our scientific knowledge. Thus, we can keep the model from learning too much from our specific sample.

7M6.Provide an informal explanation of why overly informative priors result in underfitting.

In contrast to the previous question, making the prior too informative can be too restrictive on the parameter space. This prevents our model from learning enough from our sample. We basically just get our prior distributions back, without learning anything from the data that could help make future predictions.

7H1.In 2007,The Wall Street Journalpublished an editorial (“We’re Number One, Alas”) with a graph of corportate tax rates in 29 countries plotted against tax revenue. A badly fit curve was drawn in (reconstructed at right), seemingly by hand, to make the argument that the relationship between tax rate and tax revenue increases and then declines, such that higher tax rates can actually produce less tax revenue. I want you to actually fit a curve to these data, found in`data(Laffer)`

. Consider models that use tax rate to predict tax revenue. Compare, using WAIC or PSIS, a straight-line model to any curved models you like. What do you conclude about the relationship between tax rate and tax revenue.

First, let’s standardize the data and fit a straight line, a quadratic line, and a spline model.

```
data(Laffer)
laf_dat <- Laffer %>%
mutate(tax_rate2 = tax_rate ^ 2,
across(everything(), standardize))
laf_line <- brm(tax_revenue ~ 1 + tax_rate, data = laf_dat, family = gaussian,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h1-line.rds"))
laf_quad <- brm(tax_revenue ~ 1 + tax_rate + tax_rate2, data = laf_dat,
family = gaussian,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h1-quad.rds"))
laf_spln <- brm(tax_revenue ~ 1 + s(tax_rate, bs = "bs"), data = laf_dat,
family = gaussian,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(normal(0, 0.5), class = sds),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
control = list(adapt_delta = 0.99),
file = here("fits", "chp7", "bh71-spln.rds"))
```

Let’s visualize the models:

They all look pretty similar, but the spline does show a slight curve. Next, we can look at the PSIS (called LOO in {brms} and {rstan}) and WAIC comparisons. Both the PSIS and WAIC prefer the spline model. However, the standard error of the difference in PSIS and WAIC is larger than the actual difference for all models. Thus, neither the PSIS or WAIC is really able to differentiate the models in a meaningful way. However, it should be noted that both the PSIS and WAIC have Pareto or penalty values that are exceptionally large, which could make the criteria unreliable.

```
library(loo)
laf_line <- add_criterion(laf_line, criterion = c("loo", "waic"),
overwrite = TRUE, force_save = TRUE)
#> Warning: Found 1 observations with a pareto_k > 0.7 in model 'laf_line'. It is
#> recommended to set 'moment_match = TRUE' in order to perform moment matching for
#> problematic observations.
#> Warning:
#> 1 (3.4%) p_waic estimates greater than 0.4. We recommend trying loo instead.
#> Automatically saving the model object in '/Users/runner/work/sr2-solutions/sr2-solutions/fits/chp7/b7h1-line.rds'
laf_quad <- add_criterion(laf_quad, criterion = c("loo", "waic"),
overwrite = TRUE, force_save = TRUE)
#> Warning: Found 1 observations with a pareto_k > 0.7 in model 'laf_quad'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations.
#> Warning:
#> 1 (3.4%) p_waic estimates greater than 0.4. We recommend trying loo instead.
#> Automatically saving the model object in '/Users/runner/work/sr2-solutions/sr2-solutions/fits/chp7/b7h1-quad.rds'
laf_spln <- add_criterion(laf_spln, criterion = c("loo", "waic"),
overwrite = TRUE, force_save = TRUE)
#> Warning: Found 2 observations with a pareto_k > 0.7 in model 'laf_spln'. It is
#> recommended to set 'moment_match = TRUE' in order to perform moment matching for
#> problematic observations.
#> Warning:
#> 2 (6.9%) p_waic estimates greater than 0.4. We recommend trying loo instead.
#> Automatically saving the model object in '/Users/runner/work/sr2-solutions/sr2-solutions/fits/chp7/bh71-spln.rds'
loo_compare(laf_line, laf_quad, laf_spln, criterion = "waic")
#> elpd_diff se_diff
#> laf_spln 0.0 0.0
#> laf_line -0.6 0.8
#> laf_quad -0.7 0.7
loo_compare(laf_line, laf_quad, laf_spln, criterion = "loo")
#> elpd_diff se_diff
#> laf_spln 0.0 0.0
#> laf_line -0.4 0.7
#> laf_quad -0.5 0.7
```

7H2.In the`Laffer`

data, there is one country with a high tax revenue that is an outlier. Use PSIS and WAIC to measure the importance of this outlier in the models you fit in the previous problem. Then use robust regression with a Student’s t distribution to revist the curve fitting problem. How much does a curved relationship depend upon the outlier point.

Because I used `brms::brm()`

to estimate the models, we can’t use the convenience functions to get the pointwise values for the PSIS and WAIC that are available in the {rethinking} package. So I’ll write my own, called `criteria_influence()`

. When we plot the Pareto *k* and \(p_{\Tiny\text{WAIC}}\) values, we see that observation 12 is problematic in all three models, and observation 1 is also problematic in the spline model.

```
library(gghighlight)
criteria_influence <- function(mod) {
tibble(pareto_k = mod$criteria$loo$diagnostics$pareto_k,
p_waic = mod$criteria$waic$pointwise[, "p_waic"]) %>%
rowid_to_column(var = "obs")
}
influ <- bind_rows(
criteria_influence(laf_line) %>%
mutate(type = "Linear"),
criteria_influence(laf_quad) %>%
mutate(type = "Quadratic"),
criteria_influence(laf_spln) %>%
mutate(type = "Spline")
)
ggplot(influ, aes(x = pareto_k, y = p_waic)) +
facet_wrap(~type, ncol = 2) +
geom_vline(xintercept = 0.7, linetype = "dashed") +
geom_hline(yintercept = 0.4, linetype = "dashed") +
geom_point() +
gghighlight(pareto_k > 0.7 | p_waic > 0.4, n = 1, label_key = obs,
label_params = list(size = 3)) +
labs(x = expression(Pareto~italic(k)), y = expression(p[WAIC])) +
theme(panel.border = element_rect(fill = NA))
```

Let’s refit the model using a Student’s t distribution to put larger tails on our outcome distribution, and then visualize our new models.

```
laf_line2 <- brm(bf(tax_revenue ~ 1 + tax_rate, nu = 1),
data = laf_dat, family = student,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h2-line.rds"))
laf_quad2 <- brm(bf(tax_revenue ~ 1 + tax_rate + tax_rate2, nu = 1),
data = laf_dat, family = student,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h2-quad.rds"))
laf_spln2 <- brm(bf(tax_revenue ~ 1 + s(tax_rate, bs = "bs"), nu = 1),
data = laf_dat, family = student,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(normal(0, 0.5), class = sds),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
control = list(adapt_delta = 0.99),
file = here("fits", "chp7", "bh72-spln.rds"))
```

The prediction intervals are a little bit narrower, which makes sense as the predictions are no longer being as influenced by the outlier. When we look at the new PSIS and WAIC estimates, we are no longer getting warning messages about large Pareto *k* values; however, we do still see warnings about large \(p_{\Tiny\text{WAIC}}\) values. The comparisons also tell the same story as before, with the spline as the preferred model, but no distinguishable differences between the models.

```
laf_line2 <- add_criterion(laf_line2, criterion = c("loo", "waic"),
overwrite = TRUE)
#> Warning:
#> 1 (3.4%) p_waic estimates greater than 0.4. We recommend trying loo instead.
#> Automatically saving the model object in '/Users/runner/work/sr2-solutions/sr2-solutions/fits/chp7/b7h2-line.rds'
laf_quad2 <- add_criterion(laf_quad2, criterion = c("loo", "waic"),
overwrite = TRUE)
#> Warning:
#> 1 (3.4%) p_waic estimates greater than 0.4. We recommend trying loo instead.
#> Automatically saving the model object in '/Users/runner/work/sr2-solutions/sr2-solutions/fits/chp7/b7h2-quad.rds'
laf_spln2 <- add_criterion(laf_spln2, criterion = c("loo", "waic"),
overwrite = TRUE)
#> Warning:
#> 3 (10.3%) p_waic estimates greater than 0.4. We recommend trying loo instead.
#> Automatically saving the model object in '/Users/runner/work/sr2-solutions/sr2-solutions/fits/chp7/bh72-spln.rds'
loo_compare(laf_line2, laf_quad2, laf_spln2, criterion = "waic")
#> elpd_diff se_diff
#> laf_spln2 0.0 0.0
#> laf_quad2 -0.3 0.9
#> laf_line2 -0.7 1.1
loo_compare(laf_line2, laf_quad2, laf_spln2, criterion = "loo")
#> elpd_diff se_diff
#> laf_spln2 0.0 0.0
#> laf_quad2 -0.3 0.9
#> laf_line2 -0.8 1.2
```

:::{.question .code-question}
> **7H3.** Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Species A | Species B | Species C | Species D | Species E | |
---|---|---|---|---|---|

Island 1 | .200 | .200 | .200 | .200 | .200 |

Island 2 | .800 | .100 | .050 | .025 | .025 |

Island 3 | .050 | .150 | .700 | .050 | .050 |

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why? :::

First, lets compute the entropy for each each island.

```
islands <- tibble(island = paste("Island", 1:3),
a = c(0.2, 0.8, 0.05),
b = c(0.2, 0.1, 0.15),
c = c(0.2, 0.05, 0.7),
d = c(0.2, 0.025, 0.05),
e = c(0.2, 0.025, 0.05)) %>%
pivot_longer(-island, names_to = "species", values_to = "prop")
islands %>%
group_by(island) %>%
summarize(prop = list(prop), .groups = "drop") %>%
mutate(entropy = map_dbl(prop, calc_entropy))
#> # A tibble: 3 x 3
#> island prop entropy
#> <chr> <list> <dbl>
#> 1 Island 1 <dbl [5]> 1.61
#> 2 Island 2 <dbl [5]> 0.743
#> 3 Island 3 <dbl [5]> 0.984
```

The first island has the highest entropy. This is expected, because it has the most even distribution of bird species. All species are equally likely, so the observation of any one species is not surprising. In contrast, Island 2 has the lowest entropy. This is because the vast majority of birds on this island are Species A. Therefore, observing a bird that is not from Species A would be surprising.

For the second part of the question, we need to compute the KL divergence for each pair of islands. The KL divergence is defined as:

\[ D_{KL} = \sum_i p_i(\log(p_i) - \log(q_i)) \] We’ll write a function to do this calculation.

Now, let’s calculate \(D_{KL}\) for each set of islands.

```
crossing(model = paste("Island", 1:3),
predicts = paste("Island", 1:3)) %>%
filter(model != predicts) %>%
left_join(islands, by = c("model" = "island")) %>%
rename(model_prop = prop) %>%
left_join(islands, by = c("predicts" = "island", "species")) %>%
rename(predict_prop = prop) %>%
group_by(model, predicts) %>%
summarize(q = list(model_prop),
p = list(predict_prop),
.groups = "drop") %>%
mutate(kl_distance = map2_dbl(p, q, d_kl))
#> # A tibble: 6 x 5
#> model predicts q p kl_distance
#> <chr> <chr> <list> <list> <dbl>
#> 1 Island 1 Island 2 <dbl [5]> <dbl [5]> 0.866
#> 2 Island 1 Island 3 <dbl [5]> <dbl [5]> 0.626
#> 3 Island 2 Island 1 <dbl [5]> <dbl [5]> 0.970
#> 4 Island 2 Island 3 <dbl [5]> <dbl [5]> 1.84
#> 5 Island 3 Island 1 <dbl [5]> <dbl [5]> 0.639
#> 6 Island 3 Island 2 <dbl [5]> <dbl [5]> 2.01
```

These results show us that when using Island 1 to predict Island 2, the KL divergence is about 0.87. When we use Island 1 to predict Island 3, the KL divergence is about 0.63, and so on. Overall, the distances are shorter when we used Island 1 as the model. This is because Island 1 has the highest entropy. Thus, we are less surprised by the other islands, so there’s a shorter distance. In contrast, Island 2 and Island 3 have very concentrated distributions, so predicting the other islands leads to more surprises, and therefore greater distances.

7H4.Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models`m6.9`

and`m6.10`

again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

As a reminder, here is the DAG for this example, where \(H\) is happiness, \(M\) is marriage, and \(A\) is age.

```
library(dagitty)
library(ggdag)
hma_dag <- dagitty("dag{H -> M <- A}")
coordinates(hma_dag) <- list(x = c(H = 1, M = 2, A = 3),
y = c(H = 1, M = 1, A = 1))
ggplot(hma_dag, aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_text(color = "black", size = 10) +
geom_dag_edges(edge_color = "black", edge_width = 2,
arrow_directed = grid::arrow(length = grid::unit(15, "pt"),
type = "closed")) +
theme_void()
```

First, let’s regenerate the data and estimate the models.

```
d <- sim_happiness(seed = 1977, N_years = 1000)
dat <- d %>%
filter(age > 17) %>%
mutate(a = (age - 18) / (65 - 18),
mid = factor(married + 1, labels = c("single", "married")))
b6.9 <- brm(happiness ~ 0 + mid + a, data = dat, family = gaussian,
prior = c(prior(normal(0, 1), class = b, coef = midmarried),
prior(normal(0, 1), class = b, coef = midsingle),
prior(normal(0, 2), class = b, coef = a),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h4-6.9"))
b6.10 <- brm(happiness ~ 1 + a, data = dat, family = gaussian,
prior = c(prior(normal(0, 1), class = Intercept),
prior(normal(0, 2), class = b, coef = a),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h4-6.10"))
```

For the model comparison we’ll use PSIS.

```
b6.9 <- add_criterion(b6.9, criterion = "loo")
b6.10 <- add_criterion(b6.10, criterion = "loo")
loo_compare(b6.9, b6.10)
#> elpd_diff se_diff
#> b6.9 0.0 0.0
#> b6.10 -194.0 17.6
```

PSIS shows a strong preference for `b6.9`

, which is the model that includes both age and marriage status. However, `b6.10`

provides the correct causal inference, as no additional conditioning is needed.

The reason is that in this model, marital status is a collider. Adding this variable to the model add a real statistical association between happiness and age, which improves the predictions that are made. However, the association is not causal, so intervening on age (if that were possible), would not actually change happiness. Therefore it’s important to consider the causal implications of your model before selecting one based on PSIS or WAIC alone.

7H5.Revisit the urban fox data,`data(foxes)`

, from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using`weight`

as the outcome, and containing these sets of predictor variables:

`avgfood + groupsize + area`

`avgfood + groupsize`

`groupsize + area`

`avgfood`

`area`

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (

`dSE`

).

First, let’s estimate the five models.

```
data(foxes)
fox_dat <- foxes %>%
as_tibble() %>%
select(area, avgfood, weight, groupsize) %>%
mutate(across(everything(), standardize))
b7h5_1 <- brm(weight ~ 1 + avgfood + groupsize + area, data = fox_dat,
family = gaussian,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h5_1"))
b7h5_2 <- brm(weight ~ 1 + avgfood + groupsize, data = fox_dat,
family = gaussian,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h5_2"))
b7h5_3 <- brm(weight ~ 1 + groupsize + area, data = fox_dat,
family = gaussian,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h5_3"))
b7h5_4 <- brm(weight ~ 1 + avgfood, data = fox_dat,
family = gaussian,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h5_4"))
b7h5_5 <- brm(weight ~ 1 + area, data = fox_dat,
family = gaussian,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
file = here("fits", "chp7", "b7h5_5"))
```

Then we can calculate the WAIC for each model and the model comparisons.

```
b7h5_1 <- add_criterion(b7h5_1, criterion = "waic")
b7h5_2 <- add_criterion(b7h5_2, criterion = "waic")
b7h5_3 <- add_criterion(b7h5_3, criterion = "waic")
b7h5_4 <- add_criterion(b7h5_4, criterion = "waic")
b7h5_5 <- add_criterion(b7h5_5, criterion = "waic")
comp <- loo_compare(b7h5_1, b7h5_2, b7h5_3, b7h5_4, b7h5_5, criterion = "waic")
comp
#> elpd_diff se_diff
#> b7h5_1 0.0 0.0
#> b7h5_3 -0.4 1.4
#> b7h5_2 -0.4 1.7
#> b7h5_4 -5.2 3.4
#> b7h5_5 -5.4 3.4
```

Overall, the WAIC values are very similar, and the differences all fall within the 99% intervals for the differences.

However, there does seem to be two groups of model: `b7h5_1`

, `b7h5_2`

, and `b7h5_3`

are all nearly identical; and `b7h5_4`

and `b7h5_5`

are nearly identical. To understand why this is, we can return to the DAG for this example.

The first three models (`b7h5_1`

, `b7h5_2`

, and `b7h5_3`

) all contains `groupsize`

and one or both of `area`

and `avgfood`

. The reason these models is the same is that there are no back-door path from `area`

or `avgfood`

to `weight`

. In other words, the effect of `area`

adjusting for `groupsize`

is the same as the effect of `avgfood`

adjusting for `groupsize`

, because the effect of `area`

is routed entirely through `avgfood`

.

Similarly, the last two models (`b7h5_4`

and `b7h5_5`

) are also nearly identical because of the relationship of `area`

to `avgfood`

. Because the effect of `area`

is routed entirely through `avgfood`

, including only `avgfood`

or `area`

should result in the same inferences.

## Session Info

## View the session information used to render this week.

```
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os macOS Catalina 10.15.6
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz UTC
#> date 2020-09-19
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> ! package * version date lib
#> P abind 1.4-5 2016-07-21 [?]
#> P arrayhelpers 1.1-0 2020-02-04 [?]
#> P assertthat 0.2.1 2019-03-21 [?]
#> P backports 1.1.9 2020-08-24 [?]
#> P base64enc 0.1-3 2015-07-28 [?]
#> P bayesplot 1.7.2 2020-05-28 [?]
#> P blob 1.2.1 2020-01-20 [?]
#> P bookdown 0.20.3 2020-09-19 [?]
#> P boot 1.3-25 2020-04-26 [?]
#> P bridgesampling 1.0-0 2020-02-26 [?]
#> P brms * 2.13.9 2020-09-19 [?]
#> P Brobdingnag 1.2-6 2018-08-13 [?]
#> P broom 0.7.0 2020-07-09 [?]
#> P callr 3.4.3 2020-03-28 [?]
#> P cellranger 1.1.0 2016-07-27 [?]
#> P checkmate 2.0.0 2020-02-06 [?]
#> P cli 2.0.2 2020-02-28 [?]
#> P coda 0.19-3 2019-07-05 [?]
#> P codetools 0.2-16 2018-12-24 [?]
#> P colorspace 1.4-1 2019-03-18 [?]
#> P colourpicker 1.0 2017-09-27 [?]
#> P cpp11 0.2.1 2020-08-11 [?]
#> P crayon 1.3.4 2017-09-16 [?]
#> P crosstalk 1.1.0.1 2020-03-13 [?]
#> P curl 4.3 2019-12-02 [?]
#> P dagitty * 0.3-0 2020-07-21 [?]
#> P DBI 1.1.0 2019-12-15 [?]
#> P dbplyr 1.4.4 2020-05-27 [?]
#> P desc 1.2.0 2018-05-01 [?]
#> P devtools 2.3.1.9000 2020-09-19 [?]
#> P digest 0.6.25 2020-02-23 [?]
#> P distributional 0.2.0 2020-08-03 [?]
#> P dplyr * 1.0.2 2020-08-18 [?]
#> P DT 0.15 2020-08-05 [?]
#> P dygraphs 1.1.1.6 2018-07-11 [?]
#> P ellipsis 0.3.1 2020-05-15 [?]
#> P emo 0.0.0.9000 2020-09-19 [?]
#> P evaluate 0.14 2019-05-28 [?]
#> P extrafont 0.17 2014-12-08 [?]
#> P extrafontdb 1.0 2012-06-11 [?]
#> P fansi 0.4.1 2020-01-08 [?]
#> P farver 2.0.3 2020-01-16 [?]
#> P fastmap 1.0.1 2019-10-08 [?]
#> P forcats * 0.5.0 2020-03-01 [?]
#> P fs 1.5.0 2020-07-31 [?]
#> P gdtools 0.2.2 2020-04-03 [?]
#> P generics 0.0.2 2018-11-29 [?]
#> P ggdag * 0.2.2 2020-02-13 [?]
#> P ggdist 2.2.0 2020-07-12 [?]
#> P ggforce 0.3.2 2020-06-23 [?]
#> P gghighlight * 0.3.0 2020-03-29 [?]
#> P ggplot2 * 3.3.2 2020-06-19 [?]
#> P ggraph * 2.0.3 2020-05-20 [?]
#> P ggrepel 0.8.2 2020-03-08 [?]
#> P ggridges 0.5.2 2020-01-12 [?]
#> P glue * 1.4.2 2020-08-27 [?]
#> P graphlayouts 0.7.0 2020-04-25 [?]
#> P gridExtra 2.3 2017-09-09 [?]
#> P gtable 0.3.0 2019-03-25 [?]
#> P gtools 3.8.2 2020-03-31 [?]
#> P haven 2.3.1 2020-06-01 [?]
#> P here * 0.1 2017-05-28 [?]
#> P highr 0.8 2019-03-20 [?]
#> P hms 0.5.3 2020-01-08 [?]
#> P hrbrthemes * 0.8.0 2020-03-06 [?]
#> P htmltools 0.5.0 2020-06-16 [?]
#> P htmlwidgets 1.5.1 2019-10-08 [?]
#> P httpuv 1.5.4 2020-06-06 [?]
#> P httr 1.4.2 2020-07-20 [?]
#> P igraph 1.2.5 2020-03-19 [?]
#> P inline 0.3.15 2018-05-18 [?]
#> P jsonlite 1.7.0 2020-06-25 [?]
#> P kableExtra * 1.2.1 2020-08-27 [?]
#> P knitr 1.29 2020-06-23 [?]
#> P labeling 0.3 2014-08-23 [?]
#> P later 1.1.0.1 2020-06-05 [?]
#> P lattice 0.20-41 2020-04-02 [?]
#> P lifecycle 0.2.0 2020-03-06 [?]
#> P loo * 2.3.1 2020-07-14 [?]
#> P lubridate 1.7.9 2020-06-08 [?]
#> P magrittr 1.5 2014-11-22 [?]
#> P markdown 1.1 2019-08-07 [?]
#> P MASS 7.3-51.6 2020-04-26 [?]
#> P Matrix 1.2-18 2019-11-27 [?]
#> P matrixStats 0.56.0 2020-03-13 [?]
#> P memoise 1.1.0 2017-04-21 [?]
#> P mgcv 1.8-31 2019-11-09 [?]
#> P mime 0.9 2020-02-04 [?]
#> P miniUI 0.1.1.1 2018-05-18 [?]
#> P modelr 0.1.8 2020-05-19 [?]
#> P munsell 0.5.0 2018-06-12 [?]
#> P mvtnorm 1.1-1 2020-06-09 [?]
#> P nlme 3.1-148 2020-05-24 [?]
#> P pillar 1.4.6 2020-07-10 [?]
#> P pkgbuild 1.1.0 2020-07-13 [?]
#> P pkgconfig 2.0.3 2019-09-22 [?]
#> P pkgload 1.1.0 2020-05-29 [?]
#> P plyr 1.8.6 2020-03-03 [?]
#> P polyclip 1.10-0 2019-03-14 [?]
#> P prettyunits 1.1.1 2020-01-24 [?]
#> P processx 3.4.3 2020-07-05 [?]
#> P promises 1.1.1 2020-06-09 [?]
#> P ps 1.3.4 2020-08-11 [?]
#> P purrr * 0.3.4 2020-04-17 [?]
#> P R6 2.4.1 2019-11-12 [?]
#> P ragg 0.3.1 2020-07-03 [?]
#> P ratlas * 0.0.0.9000 2020-09-19 [?]
#> P RColorBrewer 1.1-2 2014-12-07 [?]
#> P Rcpp * 1.0.5 2020-07-06 [?]
#> P RcppParallel 5.0.2 2020-06-24 [?]
#> P readr * 1.3.1 2018-12-21 [?]
#> P readxl 1.3.1 2019-03-13 [?]
#> P remotes 2.2.0 2020-07-21 [?]
#> renv 0.11.0 2020-06-26 [1]
#> P reprex 0.3.0 2019-05-16 [?]
#> P reshape2 1.4.4 2020-04-09 [?]
#> P rethinking * 2.13 2020-09-19 [?]
#> P rlang 0.4.7 2020-07-09 [?]
#> P rmarkdown 2.3.3 2020-09-19 [?]
#> P rprojroot 1.3-2 2018-01-03 [?]
#> P rsconnect 0.8.16 2019-12-13 [?]
#> P rstan * 2.21.2 2020-07-27 [?]
#> P rstantools 2.1.1 2020-07-06 [?]
#> P rstudioapi 0.11 2020-02-07 [?]
#> P Rttf2pt1 1.3.8 2020-01-10 [?]
#> P rvest 0.3.6 2020-07-25 [?]
#> P scales 1.1.1 2020-05-11 [?]
#> P sessioninfo 1.1.1 2018-11-05 [?]
#> P shape 1.4.4 2018-02-07 [?]
#> P shiny 1.5.0 2020-06-23 [?]
#> P shinyjs 1.1 2020-01-13 [?]
#> P shinystan 2.5.0 2018-05-01 [?]
#> P shinythemes 1.1.2 2018-11-06 [?]
#> P StanHeaders * 2.21.0-6 2020-08-16 [?]
#> P stringi 1.5.3 2020-09-09 [?]
#> P stringr * 1.4.0 2019-02-10 [?]
#> P svUnit 1.0.3 2020-04-20 [?]
#> P systemfonts 0.3.0 2020-09-01 [?]
#> P testthat 2.99.0.9000 2020-09-19 [?]
#> P threejs 0.3.3 2020-01-21 [?]
#> P tibble * 3.0.3 2020-07-10 [?]
#> P tidybayes * 2.1.1.9000 2020-09-19 [?]
#> P tidybayes.rethinking * 2.0.3.9000 2020-09-19 [?]
#> P tidygraph * 1.2.0 2020-05-12 [?]
#> P tidyr * 1.1.2 2020-08-27 [?]
#> P tidyselect 1.1.0 2020-05-11 [?]
#> P tidyverse * 1.3.0 2019-11-21 [?]
#> P tweenr 1.0.1 2018-12-14 [?]
#> P usethis 1.6.1 2020-04-29 [?]
#> P utf8 1.1.4 2018-05-24 [?]
#> P V8 3.2.0 2020-06-19 [?]
#> P vctrs 0.3.4 2020-08-29 [?]
#> P viridis 0.5.1 2018-03-29 [?]
#> P viridisLite 0.3.0 2018-02-01 [?]
#> P webshot 0.5.2 2019-11-22 [?]
#> P withr 2.2.0 2020-04-20 [?]
#> P xfun 0.16 2020-07-24 [?]
#> P xml2 1.3.2 2020-04-23 [?]
#> P xtable 1.8-4 2019-04-21 [?]
#> P xts 0.12-0 2020-01-19 [?]
#> P yaml 2.2.1 2020-02-01 [?]
#> P zoo 1.8-8 2020-05-02 [?]
#> source
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> Github (rstudio/bookdown@9eb20a8)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> Github (paul-buerkner/brms@c520a8c)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.1)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> Github (r-lib/devtools@df619ce)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> Github (hadley/emo@3f03b11)
#> CRAN (R 4.0.1)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> Github (atlas-aai/ratlas@746bf9c)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> Github (rmcelreath/rethinking@3b48ec8)
#> CRAN (R 4.0.2)
#> Github (rstudio/rmarkdown@204aa41)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> Github (r-lib/testthat@b3f2203)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> Github (mjskay/tidybayes@17f4dc6)
#> Github (mjskay/tidybayes.rethinking@df903c8)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.1)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#> CRAN (R 4.0.2)
#>
#> [1] /Users/runner/work/sr2-solutions/sr2-solutions/renv/library/R-4.0/x86_64-apple-darwin17.0
#> [2] /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpBlmf7r/renv-system-library
#>
#> P ── Loaded and on-disk path mismatch.
```