library(tidyverse)
set.seed(1234)
# Question 1: Simulate a dataset with a structural break halfway through.
# Period 1 (rows 1 to 100): y = 1 + 2 * x + noise
# Period 2 (rows 101 to 200): y = 1 + 4 * x + noise
data <- tibble(
x = rnorm(n = 100, mean = 0, sd = 1),
y = _____ + rnorm(n = 100, mean = 0, sd = 1)
) %>%
bind_rows(
tibble(
x = rnorm(n = 100, mean = 0, sd = 1),
y = _____ + rnorm(n = 100, mean = 0, sd = 1)
)
)
# Question 2: What is the true effect of x on y in rows 1-100? In rows 101-200?
# Question 3: Run ONE pooled regression `y ~ x` on the full dataset.
data %>%
lm(_____, data = .) %>%
broom::tidy() %>%
slice(2)
# Question 4: Now run SEPARATE regressions for rows 1-100 and rows 101-200.
data %>%
slice(1:100) %>%
lm(_____, data = .) %>%
broom::tidy() %>%
slice(2)
data %>%
slice(101:200) %>%
lm(_____, data = .) %>%
broom::tidy() %>%
slice(2)18 Structural Breaks and Regime Changes
In this assignment, we’ll explore another central idea from Econometrics: structural breaks, and how they affect the Financial models we’ve been studying.
A structural break says: When the relationship between variables changes at some point in time, a regression estimated over the whole sample will average across the regimes, producing an estimate that is wrong in both periods.
We’ll start with a simulation of a structural break, then do a simple proof, and then finally explore how CAPM betas shift dramatically during major market crises like 2008 and COVID-19.
Part 1: map() Simulation
Question 5: Fill in the blanks to synthesize what you learned in the last 4 questions:
- In rows 1-100, the true \(\beta_1\) was ____, and in rows 101-200, the true \(\beta_1\) was ____.
- When we ran one pooled regression, we estimated \(\beta_1\) to be ____ because _________________.
- When we ran seperate regressions for rows 1-100 and rows 101-200, we (were/were not) able to recover approximately the true values for \(\beta_1\) in each regime.
Question 6: Suppose a researcher does not know that a structural break occurred and uses only the pooled regression. They report a single \(\hat{\beta}_1\) and use it to make predictions in both periods. In which period will their predictions be more wrong, period 1 or period 2?
Part 2: Econometric Proof
The simulation showed that a pooled regression averages across regimes. Here’s the math behind why.
Setup: Suppose, just like in the simulation, the data has 200 observations and there is a structural break halfway through after the first 100 observations, where:
\[y_i = \begin{cases} \beta_0 + \beta_1^A x_i + \varepsilon_i & \text{if } i \leq 100 \\ \beta_0 + \beta_1^B x_i + \varepsilon_i & \text{if } i > 100 \end{cases}\]
where \(\beta_1^A \neq \beta_1^B\). Instead of estimating the two regimes separately, suppose we run one pooled regression over the full sample. The pooled OLS slope estimate is:
\[\hat{\beta}_1 = \frac{Cov(X, Y)}{Var(X)}\]
Our goal is to show that: \[\hat{\beta}_1 = \frac{1}{2} \hat{\beta}_1^A + \frac{1}{2} \hat{\beta}_1^B\]
For simplicity, we’ll make two assumptions:
- Group means are equal: \(\bar{X}_A = \bar{X}_B\) and \(\bar{Y}_A = \bar{Y}_B\)
- Group variances are equal: \(Var(X) = Var(X_A) = Var(X_B)\)
\(Cov(X, Y) = \frac{\sum_{i = 1}^N (x_i - \bar{x}) (y_i - \bar{y})}{N}\)
\(\hat{\beta}_1 = \frac{Cov(X, Y)}{Var(X)} = \frac{\frac{1}{200} \sum_{i = 1}^{200} (x_i - \bar{x}) (y_i - \bar{y})}{Var(X)}\)
\(\hat{\beta}_1^A = \frac{Cov(X_A, Y_A)}{Var(X_A)} = \frac{\frac{1}{100} \sum_{i = 1}^{100} (x_i - \bar{x}_A) (y_i - \bar{y}_A)}{Var(X_A)}\)
\(\hat{\beta}_1^B = \frac{Cov(X_B, Y_B)}{Var(X_B)} = \frac{\frac{1}{100} \sum_{i = 101}^{200} (x_i - \bar{x}_B) (y_i - \bar{y}_B)}{Var(X_B)}\)
Proof:
\[\begin{align} \hat{\beta}_1 &= \frac{Cov(X,Y)}{Var(X)} \\ &= \frac{ \frac{1}{200}\sum_{i=1}^{200}(x_i-\bar{x})(y_i-\bar{y}) }{ Var(X) } \\ &= \frac{ \frac{1}{200}\sum_{i=1}^{100}(x_i-\bar{x})(y_i-\bar{y}) }{ Var(X) } + \frac{ \frac{1}{200}\sum_{i=101}^{200}(x_i-\bar{x})(y_i-\bar{y}) }{ Var(X) } \end{align}\]
By Assumption 1, \(\bar{x}_A = \bar{x}_B = \bar{x}\) and \(\bar{y}_A = \bar{y}_B = \bar{y}\), so we may replace the full-sample means with group-specific means. We also factor \(\frac{1}{200} = \frac{1}{2} \cdot \frac{1}{100}\) and replace \(Var(X)\) with the equal group variances \(Var(X_A)\) and \(Var(X_B)\):
\[\begin{align} \hat{\beta}_1 &= \frac{1}{2} \frac{ \frac{1}{100}\sum_{i=1}^{100}(x_i-\bar{x}_A)(y_i-\bar{y}_A) }{ Var(X_A) } + \frac{1}{2} \frac{ \frac{1}{100}\sum_{i=101}^{200}(x_i-\bar{x}_B)(y_i-\bar{y}_B) }{ Var(X_B) } \\ &= \frac{1}{2}\hat{\beta}_{1}^A + \frac{1}{2}\hat{\beta}_{1}^B. \end{align}\]
Question 7: Now suppose the break does not fall at the halfway point: \(\beta_1^A = 1.2\) and \(\beta_1^B = 2.5\), with 90% of observations coming from the normal period. Using the general weighted-average formula \(\hat{\beta}_1 = w \cdot \beta_1^A + (1-w) \cdot \beta_1^B\), what would the pooled estimate be?
Part 3: Stock Market Investigation
CAPM says \(\beta\) measures how much a stock moves with the market. But this relationship may not be stable over time, particularly during crises when correlations across assets tend to spike and volatility explodes. In this section, we’ll estimate Amazon and Walmart’s betas separately in 2-year windows, watching how its market exposure shifts over time.
# from previous chapters
stock_panel <- read_csv("stock_panel.csv")
library(frenchdata)
french <- download_french_data('Fama/French 3 Factors')$subsets$data[[1]] %>%
rename(mkt_rf = `Mkt-RF`) %>%
mutate(
date = ym(date),
across(c(mkt_rf, SMB, HML, RF), ~ .x / 100)
)
joined_data <- stock_panel %>%
mutate(date = floor_date(date, unit = "month")) %>%
left_join(french, join_by(date)) %>%
select(permno, comnam, date, ret, mkt_rf, RF)
# Question 8: Use map() to estimate Amazon's CAPM beta
# for each 2-year window in the data set. Amazon's
# permno is 84788.
amazon_betas <- map(
.x = seq.Date(from = ymd(20000101), to = ymd(20240101), by = "2 years"),
.f = function(t) {
joined_data %>%
filter(permno == ___, date >= ___, date < ___ %m+% years(2)) %>%
lm(ret - RF ~ mkt_rf, data = .) %>%
broom::tidy() %>%
slice(2) %>%
select(beta = estimate) %>%
mutate(period = t, stock = "Amazon")
}
) %>%
bind_rows()
# Question 9: Do the same for Walmart (permno 55976),
# then plot both together.
walmart_betas <- map(
.x = seq.Date(from = ymd(20000101), to = ymd(20240101), by = "2 years"),
.f = function(t) {
joined_data %>%
filter(permno == ____, date >= ____, date < ____ %m+% years(2)) %>%
lm(ret - RF ~ mkt_rf, data = .) %>%
broom::tidy() %>%
slice(2) %>%
select(beta = estimate) %>%
mutate(period = t, stock = "Walmart")
}
) %>%
bind_rows()
bind_rows(____, ____) %>%
ggplot(aes(x = period, y = beta, color = stock, group = stock)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 1, linetype = "dashed", color = "gray50")Question 10: Looking at the plot, which stock has a higher beta on average? In which windows does the gap between them narrow? What does this tell you about how structural breaks can affect aggressive vs. defensive stocks differently?
Download this assignment
Here’s a link to download this assignment.