---
title: "16 Measurement Error"
format:
  html:
    self-contained: TRUE
---

In this assignment, we'll explore one central idea from Econometrics: **measurement error**, and how it relates to the Financial models we've been studying.

Measurement Error says: **When an explanatory variable is measured with error, its effect on the dependent variable gets "attenuated", or at least partially washed out (becomes biased toward 0).**

We'll start with a simulation of measurement error, then do a mathematical proof, and then finally explore how noisy CAPM beta estimates can weaken Fama-MacBeth results and prevent us from getting an accurate measure of the risk premium.

## Part 1: map() Simulation

```{r, message = F}
library(tidyverse)

# Question 1: Simulate a data set with variables x and y.
# x should be random noise (100 observations, mean of 50, sd of 20)
# y should be 100 + 3 * x + rnorm(n = 100, mean = 0, sd = 10)

tibble(
  x = ______________,
  y = ______________
)

# Question 2: What is the true effect of x on y?

# Question 3: Take your data set from question 1 and run the regression `y ~ x`
# several times. You should see your estimates for beta 1 are pretty close 
# to the true value.

tibble(
  x = ______________,
  y = ______________
) %>%
  lm(______________, data = .) %>%
  broom::tidy()

# Question 4: Take your simulation from the previous question and add another
# variable: x_observed, which is x + rnorm(n = 100, mean = 0, sd = 10)
# That is, x_observed is x **with measurement error**.
# If x effects y, but x_observed is the only version of x we're able to
# observe as researchers, we estimate the model `y ~ x_observed` instead
# of the model we'd like to run, which is `y ~ x`. Estimate the model
# `y ~ x_observed` several times. Are your estimates for beta 1 still
# close to the true value?

tibble(
  x = ______________,
  x_observed = x + rnorm(n = 100, mean = 0, sd = 10),
  y = 100 + 3 * x + rnorm(n = 100, mean = 0, sd = 10)
) %>%
  lm(______________, data = .) %>%
  broom::tidy()

# Question 5: Use `map()` to run the simulation from question 4 1000 times,
# visualizing the distribution of beta 1 estimates. Add a vertical line for
# the true value for beta 1. Do you capture the measurement error bias?

map(
  .x = 1:1000,
  .f = function(s) {
    tibble(
      x = ______________,
      x_observed = ______________,
      y = ______________
    ) %>%
      lm(______________, data = .) %>%
      broom::tidy() %>%
      slice(2) %>%
      select(estimate)
  }
) %>%
  bind_rows() %>%
  ggplot(aes(x = estimate)) +
  geom_density() +
  geom_vline(xintercept = ____)

# Question 6: The more measurement error, the more bias there will be.
# How much measurement error do you need before the estimate for beta 1
# is centered on something less than 0.5?

map(
  .x = 1:1000,
  .f = function(s) {
    tibble(
      x = ______________,
      x_observed = ______________,
      y = ______________
    ) %>%
      lm(______________, data = .) %>%
      broom::tidy() %>%
      slice(2) %>%
      select(estimate)
  }
) %>%
  bind_rows() %>%
  ggplot(aes(x = estimate)) +
  geom_density() +
  geom_vline(xintercept = ____)
```

## Part 2: Econometric Proof

In Econometrics, you learned that the regression slope coefficient is: $\hat{\beta_1} = \frac{Cov(X, Y)}{Var(X)}$. Variances are always positive, but covariances can be positive or negative depending on how X and Y move together: if an increase in X is related to an increase in Y, they have a positive covariance. If an increase in X is related to a decrease in Y, they have a negative covariance.

Suppose that instead of observing $X$, we observe measurement error on X: $X^{ME} = X + u$, where $u$ is random noise. In that case, when we run the regression $Y ~ X^{ME}$, we get $\hat{\beta_1}^{ME} = \frac{Cov(X^{ME}, Y)}{Var(X^{ME})}$. Let's write a proof that measurement error **attenuates** the estimation of $\hat{\beta_1}$: that is, $\hat{\beta_1^{ME}}$ is closer to zero compared to $\hat{\beta_1}$.

Step 1: substitute $X^{ME} = X + u$.

$$\hat{\beta_1}^{ME} = \frac{Cov(X^{ME}, Y)}{Var(X^{ME})} = \frac{Cov(X + u, Y)}{Var(X + u)}$$

Step 2: Apply the covariance rule: $Cov(X + Z, Y) = Cov(X, Y) + Cov(Z, Y)$

$$\hat{\beta_1}^{ME} = \frac{Cov(X, Y) + Cov(u, Y)}{Var(X + u)}$$

Step 3: Recognize that as long as $u$ is random noise, it should not covary at all with Y. So in expectation, we should have that $Cov(u, Y) = 0$.

$$\hat{\beta_1}^{ME} = \frac{Cov(X, Y) + 0}{Var(X + u)} = \frac{Cov(X, Y)}{Var(X + u)}$$

Step 4: Recognize that adding $u$ to $X$ increases its variance:

$$Var(X + u) \geq Var(X)$$
Step 5: Compare $\hat{\beta_1}^{ME}$ to $\hat{\beta_1}$.

$$\hat{\beta_1}^{ME} = \frac{Cov(X, Y)}{Var(X + u)}$$

$$\hat{\beta_1} = \frac{Cov(X, Y)}{Var(X)}$$

**Question 7:** What does this imply about the relationship between $\hat{\beta_1}^{ME}$ and $\hat{\beta_1}$? Explain: what will happen to $\hat{\beta_1}$ under measurement error if $Cov(X, Y)$ is positive? What about when it's negative?

## Part 3: Stock Market Investigation

Recall that Fama-MacBeth regressions are a two-step procedure. First, we estimate each stock's CAPM beta, which measures the stock's risk relative to the market. Then, for each time period, we run the regression `ret - rf ~ beta` to estimate the risk premium $\lambda$, which captures how much compensation investors receive for holding riskier stocks during that period. Finally, we average these estimated $\lambda$ values over time to get an overall measure of the market risk premium: the average compensation investors earn for taking on additional risk.

**Knowing what you now know about measurement error, what will happen to our estimation of $\lambda$ when CAPM betas are measured with error?** In this section, we'll explore this question.

```{r, message = F}
# from last class, use stock_panel.csv
stock_panel <- read_csv("stock_panel.csv")

# we also need the risk-free rate and the S&P500 return for each month, 
# which we can get from Kenneth French's data:
library(frenchdata)

french <- download_french_data('Fama/French 3 Factors')$subsets$data[[1]] %>%
  rename(mkt_rf = `Mkt-RF`) %>%
  mutate(
    date = ym(date),
    across(c(mkt_rf, SMB, HML, RF), ~ .x / 100)
    )

# Join our stock_panel with french:
joined_data <- stock_panel %>%
  mutate(date = floor_date(date, unit = "month")) %>%
  left_join(french, join_by(date)) %>%
  select(permno, comnam, date, ret, mkt_rf, RF)

# Question 8: Take Apple (permno 14593). Previously, we've estimated Apple's
# CAPM beta to be 1.34 using 25 years of data. When we just take
# 1 year of data at a time, how noisy is Apple's CAPM beta?

map(
  .x = seq.Date(from = ymd(20000101), to = ymd(20240101), by = "years"),
  .f = function(t) {
    joined_data %>%
      filter(date >= t, date < t %m+% months(12), permno == ______) %>%
      lm(ret - RF ~ mkt_rf, data = .) %>%
      broom::tidy() %>%
      slice(2) %>%
      select(beta = estimate) %>%
      mutate(date = t)
  }
) %>%
  bind_rows() %>%
  ggplot(aes(x = date, y = beta)) +
  geom_line()

# Question 9: Next, estimate CAPM betas for each stock that has 300 observations
# (that is, the stock reported a return for each month in the data)
betas <- map(
  .x = joined_data %>% count(permno) %>% filter(n == 300) %>% pull(permno),
  .f = function(p) {
    joined_data %>%
      filter(permno == ____) %>%
      lm(ret - RF ~ mkt_rf, data = .) %>%
      broom::tidy() %>%
      slice(2) %>%
      select(beta = estimate) %>%
      mutate(permno = p)
  }
) %>%
  bind_rows()

# Question 10: How many regressions did we just run?


# Add betas to joined_data:
joined_data <- joined_data %>%
  inner_join(betas, join_by(permno))

# Question 11: Fama-MacBeth step 2: use `betas` to estimate lambda: risk premiums.
map(
  .x = joined_data %>% pull(date) %>% unique(),
  .f = function(t) {
    joined_data %>%
      filter(date == t) %>%
      lm(ret - RF ~ ____, data = .) %>%
      broom::tidy() %>%
      slice(2) %>%
      select(estimate)
  }
) %>%
  bind_rows() %>%
  summarize(
    risk_premium_estimate = mean(estimate), 
    t_statistic = mean(estimate) / (sd(estimate) / sqrt(n()))
    )
```

**Question 12:** According to CAPM, stocks with higher beta should earn higher average excess returns. Compare your estimated Fama-MacBeth risk premium from Question 11 to this prediction. Is the estimated risk premium positive? Is it statistically significant? Based on this assignment, explain how noisy beta estimates could attenuate the estimated reward for risk.