17 Multicollinearity

In this assignment, we’ll explore another central idea from Econometrics: multicollinearity, and how it affects the financial models we’ve been studying.

Multicollinearity occurs when two or more explanatory variables are highly correlated with each other. When this happens, their individual effects on the dependent variable become very hard to disentangle. OLS estimates remain unbiased, but their standard errors can become large.

We’ll start with a simulation of multicollinearity, then do a mathematical proof, and finally explore how correlated factors in the Fama-French 3-Factor model make it difficult to precisely estimate each factor’s risk premium.

Part 1: map() Simulation

Question 1. Simulate a data set of 100 observations. Let x1 and x2 be independent standard normal random variables, and let y depend on both with its own noise term. Then measure the correlation between x1 and x2.

library(tidyverse)

# Simulate with independent x1 and x2, then measure their correlation. 
# When you run the code over and over, you should see because x1 and x2
# are independent, their  correlation be small, and have no tendency 
# to be positive or negative.

tibble(
  x1 = ___(n = 100, mean = 0, sd = 1),
  x2 = ___(n = 100, mean = 0, sd = 1),
  y  = 5 + 3 * x1 + 2 * x2 + ___(n = 100, mean = 0, sd = 1)
) %>%
  summarize(correlation = cor(x1, x2))

Correlation measures how closely two variables move together, on a scale from -1 to 1. A correlation of 0 means the variables are uncorrelated: knowing one tells you nothing about the other, and a scatter plot of the two would look like a shapeless cloud of points. A correlation of 1 means perfectly positively correlated: as one increases, the other increases proportionally, and the scatter plot would show a perfect upward-sloping line. A correlation of -1 means perfectly negatively correlated: as one increases, the other decreases proportionally, forming a perfect downward-sloping line. Since x1 and x2 are drawn independently here, we expect their correlation to be close to 0.

Question 2. What are the true effects of x1 and x2 on y? That is, what are \(\beta_1\) and \(\beta_2\)?

Question 3. Use map() to do this 100 times: generate data, run the regression y ~ x1 + x2, and get the coefficient \(\hat{\beta}_2\) and its p-value. Then plot the distribution of \(\hat{\beta}_2\) and count how often it is statistically significant out of the 100 simulations. Is the estimate centered on the true value?

# 100 repeated simulations, using independent x1 and x2
map(
  .x = 1:100,
  .f = function(a) {
    tibble(
      x1 = ___,
      x2 = ___,
      y  = ___
    ) %>%
      lm(y ~ x1 + x2, data = .) %>%
      broom::tidy() %>%
      slice(3)
  }
) %>%
  bind_rows() %>%
  {
    # Plot the distribution
    print(
      ggplot(., aes(x = estimate)) +
        geom_histogram() +
        geom_vline(xintercept = ___)
    )
    
    # Count the number of times beta2 is statistically significant
    count(., significant = p.value < ___)
  }

Question 4. Now introduce multicollinearity. Instead of drawing x2 independently, set x2 = x1 + rnorm(n = 100, mean = 0, sd = 0.2). This means x2 is no longer independent from x1: they will be highly correlated, because x2 is almost identical to x1 with only a tiny bit of extra noise added. First measure the correlation between them, then repeat the simulation 100 times and compare the distribution of \(\hat{\beta}_2\) to what you found in Question 3.

set.seed(1234)

# x2 is now nearly identical to x1
tibble(
  x1 = ______,
  x2 = x1 + rnorm(n = 100, mean = 0, sd = 0.2),
  y  = 5 + 3 * x1 + 2 * x2 + rnorm(n = 100, mean = 0, sd = 1)
) %>%
  summarize(correlation = ____)

# 100 repeated simulations, highly correlated x1 and x2
map(
  .x = 1:100,
  .f = function(a) {
    tibble(
      x1 = ____,
      x2 = x1 + rnorm(n = 100, mean = 0, sd = ___),
      y  = 5 + 3 * x1 + 2 * x2 + rnorm(n = 100, mean = 0, sd = 1)
    ) %>%
      lm(y ~ x1 + x2, data = .) %>%
      broom::tidy() %>%
      slice(___)
  }
) %>%
  bind_rows() %>% {
    print(
      ggplot(., aes(x = ____)) +
        geom_histogram() +
        geom_vline(xintercept = ___)
    )
    
    count(., significant = p.value < ___)
  }

# Now change x2 from using sd = 0.2 to 0.1, 0.05, and 0.025.
# Fill in this table with approximations:
# --------------------------------------------------------------
# | sd    | min(estimate) | max(estimate) | number significant |
# --------------------------------------------------------------
# | 0.2   | 0.75          | 3.4           | 95                 |
# | 0.1   | ____          | ____          | __                 |
# | 0.05  | ____          | ____          | __                 |
# | 0.025 | ____          | ____          | __                 |

Question 5. Summarize: how does the degree of multicollinearity affect the distribution of OLS coefficients?

Question 6. In Question 1, the correlation between x1 and x2 should be close to zero. Why?

Because x1 and x2 were drawn from different distributions
Because x1 and x2 were drawn independently, so they share no common component
Because the sample size of 100 is too small to detect correlation
Because y was constructed to depend equally on both

Question 7. When you set x1 and x2 to be independent, were your estimates of \(\hat{\beta}_2\) precise, and were you often able to reject the null hypothesis that \(\beta_2 = 0\)?

The estimates were imprecise and we rarely rejected the null
The estimates were precise and centered near 2, and we often rejected the null
The estimates were precise but we rarely rejected the null
The estimates were imprecise but we often rejected the null

Question 8. When x2 became highly correlated with x1, which of the following best describes what changed?

The distribution of \(\hat{\beta}_2\) shifted away from 2
The distribution of \(\hat{\beta}_2\) stayed centered near 2 but became much wider
The distribution of \(\hat{\beta}_2\) became narrower and more precise
The estimate of \(\hat{\beta}_1\) became biased toward zero

Question 9. When x2 = x1 + rnorm(n, 0, 0.2), the correlation between x1 and x2 is close to 1. Why does high correlation make \(\hat{\beta}_2\) hard to estimate, even though it remains centered on the true value?

High correlation causes OLS to average the two coefficients together
OLS cannot separate the effects of two nearly identical variables, so small data fluctuations produce large swings in the estimates
High correlation biases the estimate toward zero
OLS drops one variable automatically when correlation exceeds 0.9

Part 2: Econometric Proof

Multicollinearity does not bias \(\hat{\beta}_2\) at all, but it makes \(\hat{\beta}_2\) extremely imprecise. Here’s why.

Start with a simple case: perfect multicollinearity. Suppose \(X_2 = X_1\) exactly. Then the model \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon\) becomes:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1 + \varepsilon = \beta_0 + (\beta_1 + \beta_2) X_1 + \varepsilon\]

OLS can estimate \(\beta_1 + \beta_2\) just fine, but it has no way to split that total between \(\beta_1\) and \(\beta_2\). Any pair where \(\beta_1 + \beta_2\) equals the right number fits the data equally well. The individual coefficients are completely unidentifiable.

Step 1: Now suppose \(X_2 = X_1 + u\), where \(u\) is small random noise, so \(X_1\) and \(X_2\) are nearly but not perfectly identical. The quantity \(u\) is the only signal OLS has to separate the two coefficients: it is the part of \(X_2\) that differs from \(X_1\). In expectation, the OLS estimate of \(\beta_2\) equals:

\[E[\hat{\beta}_2] = \frac{Cov(u, Y)}{Var(u)}\]

Step 2: As \(u\) gets smaller (i.e., as \(X_1\) and \(X_2\) become more similar), \(Var(u) \to 0\).

Step 3: Now think about the numerator. Because \(u\) is independent of \(X_1\) and of the regression error \(\varepsilon\), and because \(Y\) depends on \(u\) only through the \(\beta_2 X_2 = \beta_2(X_1 + u)\) term, it can be shown that \(Cov(u, Y) = \beta_2 \cdot Var(u)\). Plugging this in:

\[E[\hat{\beta}_2] = \frac{\beta_2 \cdot Var(u)}{Var(u)} = \beta_2\]

So \(\hat{\beta}_2\) is unbiased: on average, it equals the true value. But consider what happens in any individual sample. \(Cov(u, Y)\) fluctuates around \(\beta_2 \cdot Var(u)\) with random noise. When \(Var(u)\) is tiny, even a small fluctuation in the numerator gets divided by a very small denominator, producing a huge swing in \(\hat{\beta}_2\). The estimate bounces around wildly from sample to sample, even though it is centered correctly on \(\beta_2\).

Question 10. As \(x_1\) and \(x_2\) become more and more similar to each other, what happens to \(Var(u)\)?

It becomes negative
It shrinks toward zero
It stays roughly unchanged
It gets larger

Question 11. The formula shows \(E[\hat{\beta}_2] = \beta_2\). What does this tell you?

The estimate is unbiased across repeated samples
The estimate becomes exact in large samples
The estimate always has low variance
The estimate is guaranteed to match the true value in each regression

Question 12. Even though \(\hat{\beta}_2\) is unbiased, it is imprecise when multicollinearity is high. What does “imprecise” mean in this context?

The regression software may fail to compute the coefficient
The estimate is systematically pushed toward zero
The estimate can bounce around substantially across samples
The estimate becomes negative even if the true coefficient is positive

Question 13. If \(Var(u) = 0.01\) (high multicollinearity) versus \(Var(u) = 1.0\) (low multicollinearity), which dataset gives more precise estimates of \(\hat{\beta}_2\)?

The two datasets should give similar precision because the estimator is unbiased
\(Var(u) = 1.0\), because more independent variation improves precision
Precision depends only on the sample size, not on \(Var(u)\)
\(Var(u) = 0.01\), because lower variance always reduces estimation noise

Part 3: Stock Market Investigation

CAPM says that a stock’s expected return should depend only on its exposure to the overall market:

\[r_a - r_f = \alpha + \beta (r_m - r_f)\]

But over time, financial economists discovered many patterns in stock returns that CAPM struggled to explain. Small-cap stocks tended to outperform large-cap stocks. Value stocks tended to outperform growth stocks. More profitable firms often earned higher returns. Firms that invested aggressively sometimes earned lower returns.

Researchers responded by creating more and more “factors” designed to capture these patterns. This explosion of proposed factors became known as the factor zoo.

One of the most influential extensions was the Fama-French 5-Factor model, which includes:

MKT: the overall market excess return
SMB (Small Minus Big): small stocks minus large stocks
HML (High Minus Low): value stocks minus growth stocks
RMW (Robust Minus Weak): profitable firms minus unprofitable firms
CMA (Conservative Minus Aggressive): conservative investors minus aggressive investors

But the problem is, many of these factors are economically related and tend to move together over time. During recessions, for example, small firms, distressed firms, and unprofitable firms may all perform poorly at the same time. When factors move together, regressions struggle to separate their individual effects. This is exactly the multicollinearity problem we studied earlier.

In this section, we’ll investigate whether the factors are correlated with each other, whether those correlations line up with major historical events in the stock market, and whether adding more factors increases standard errors of regression coefficients.

Question 14. Answer this question before looking at the data. During the 2008 financial crisis, many risky stocks performed poorly at the same time. Suppose:

small firms were hit harder than large firms,
value firms struggled relative to growth firms,
unprofitable firms performed especially badly,
and firms making aggressive investments suffered large losses.

Based on this story, would you expect the following pairs of factors to be positively correlated, negatively correlated, or roughly uncorrelated during crisis periods?

mkt_rf and SMB
mkt_rf and HML
HML and RMW
SMB and CMA

Question 15. Compute the correlation matrix for the five Fama-French factors. Which pairs appear most strongly correlated? What do these correlations say about the market overall?

french %>%
  select(mkt_rf, SMB, HML, RMW, CMA) %>%
  cor()

Question 16. Instead of averaging across all time periods, let’s focus on a few major historical events when financial markets experienced unusually large shocks.

For each of the following periods, compute the correlations between the Fama-French factors:

The Oil Crisis and Stagflation (1973-1975)
The Dot-Com Crash (2000-2002)
The Global Financial Crisis (2007-2009)
The COVID Crash (2020)

What’s the largest positive correlation? What’s the largest negative correlation? Explain the intuition of why these correlations might have been so strong.

# Correlations during major historical events
bind_rows(

  french %>%
    filter(date >= ymd("1973-01-01"),
           date <= ymd("1975-12-31")) %>%
    summarize(
      cor_mkt_smb = cor(mkt_rf, SMB),
      cor_mkt_hml = cor(mkt_rf, HML),
      cor_hml_rmw = cor(HML, RMW),
      cor_smb_cma = cor(SMB, CMA)
    ) %>%
    mutate(event = "Oil Crisis"),

  french %>%
    filter(date >= ymd("2000-01-01"),
           date <= ymd("2002-12-31")) %>%
    summarize(
      cor_mkt_smb = cor(mkt_rf, SMB),
      cor_mkt_hml = cor(mkt_rf, HML),
      cor_hml_rmw = cor(HML, RMW),
      cor_smb_cma = cor(SMB, CMA)
    ) %>%
    mutate(event = "Dot-Com Crash"),

  french %>%
    filter(date >= ymd("2007-01-01"),
           date <= ymd("2009-12-31")) %>%
    summarize(
      cor_mkt_smb = cor(mkt_rf, SMB),
      cor_mkt_hml = cor(mkt_rf, HML),
      cor_hml_rmw = cor(HML, RMW),
      cor_smb_cma = cor(SMB, CMA)
    ) %>%
    mutate(event = "Financial Crisis"),

  french %>%
    filter(date >= ymd("2020-01-01"),
           date <= ymd("2020-12-31")) %>%
    summarize(
      cor_mkt_smb = cor(mkt_rf, SMB),
      cor_mkt_hml = cor(mkt_rf, HML),
      cor_hml_rmw = cor(HML, RMW),
      cor_smb_cma = cor(SMB, CMA)
    ) %>%
    mutate(event = "COVID Crash")

)

Question 17. Take Apple (permno 14593). Compare the Fama-French 3-factor model to the Fama-French 5-factor model. What happens to the coefficients and the standard errors on mkt_rf, SMB, and HML when you add 2 more factors? Does this make sense given what we now know about multicollinearity?

# Fama-French 3-factor model
joined_data %>%
  filter(permno == 14593) %>%
  lm(ret - RF ~ ___ + ___ + ___, data = .) %>%
  broom::tidy() %>%
  select(term, estimate, std.error)

# Fama-French 5-factor model
joined_data %>%
  filter(permno == 14593) %>%
  lm(ret - RF ~ ___ + ___ + ___ + ___ + ___, data = .) %>%
  broom::tidy() %>%
  select(term, estimate, std.error)

Download this assignment

Here’s a link to download this assignment.