HW Guide Part B: Reconstructing the S&P 500 Index#
Introduction#
The S&P 500 Index is one of the most well-known benchmarks for the U.S. equity market. It represents the performance of 500 large-cap companies across various industries, serving as a proxy for the overall stock market. Constructing an index of this nature involves aggregation of market data, weighting by market capitalization, and accounting for changes in index membership over time. This lecture introduces the mechanics of the S&P 500 index, provides an overview of its construction, and guides you through a case study on approximating its level and returns using data from CRSP and WRDS. Some of the information needed to perfectly replicate the S&P 500 index is proprietary and not available through WRDS. So, the approximation in the next section will be imperfect. However, it will be close enough. For now, we will use this note to get a better understanding of how the S&P 500 is constructed.
Why Reconstruct the S&P 500 Index?#
Reconstructing the S&P 500 index from its underlying constituents is a valuable exercise for students of quantitative finance. This process:
Provides Practical Data Experience: You’ll gain hands-on experience with downloading and processing real financial data from Wharton Research Data Services (WRDS) and CRSP.
Develops Insight into Index Construction: By breaking down the methodology, you will better understand the structure and mechanics of market-capitalization-weighted indices.
Builds Programming Skills: The reconstruction process requires advanced data handling, including merging datasets, calculating weights, and implementing rebalancing rules.
Reveals Approximation Techniques: As the S&P 500’s official calculation includes proprietary adjustments, this exercise demonstrates how to approximate real-world phenomena using accessible data and mathematical techniques.
Enhances Quantitative Finance Skills: The reconstruction builds your ability to analyze returns, index performance, and the impact of portfolio rebalancing.
Learning Goals#
By the end of this lecture and exercise, you should be able to:
Explain the construction methodology of the S&P 500 index.
Use historical data on index constituents, prices, and shares outstanding to approximate the index.
Implement two methods for approximating the index:
Method A: a simple sum of the market caps of the index’s constituents
Method B: an approximation that assumes that a trader forms a porfolio holding the index’s constituents and rebalances the portfolio quarterly.
Compare your reconstructed index to the actual S&P 500 index by analyzing the correlation of levels and returns.
Official S&P 500 Index Methodology#
The S&P 500 index contains approximately 500 of the largest publicly traded U.S. companies by market capitalization, selected by the S&P Dow Jones Indices committee based on criteria such as liquidity, profitability, and market size. The index is a market-capitalization-weighted index, meaning that each stock in the index is weighted by its market cap. The index is rebalanced quarterly to reflect changes in membership and market values, ensuring it accurately represents the large-cap equity market. Key features of the index include:
A full description of the methodology used to calculate the S&P 500 index is available here. However, the following steps provide a simplified overview of the calculation process.
1. Determine the Market Capitalization of Each Constituent Company#
The market capitalization of a company represents its total equity value in the stock market. It is determined by multiplying the current market price per share by the total number of outstanding shares.
where \(P_{i,t}\) is the stock price and \(\text{Shrout}_{i,t}\) is the number of shares outstanding.
2. Adjust the Market Capitalization for Free Float#
The S&P 500 utilizes a “free-float market capitalization-weighted methodology”. This method focuses solely on the shares readily available for trading by the public, excluding those held by company insiders or governments. Therefore, the next crucial step involves adjusting the market capitalization to account for the “free float” of each company’s shares. Free float refers to the portion of outstanding shares that are readily available for trading in the open market. This adjustment ensures that the index accurately reflects the true market value of the companies based on the shares that investors can actively buy and sell.
To calculate the free-float market capitalization, an Investable Weight Factor (IWF) is employed. The IWF represents the percentage of a company’s total outstanding shares that are included in the index calculation. It is used to adjust for various factors, including foreign ownership restrictions, that might affect a stock’s actual weight in the index.
where \(\text{IWF}_{i}\) represents the proportion of a company’s shares that are freely available for trading.
The IWF is typically expressed as a percentage and is obtained from financial data providers. For example, if 80% of a company’s shares are available for public trading, its IWF would be 0.8.
3. Calculate the Total Market Capitalization of the Index#
Once the free-float market capitalization is calculated for each company, these values are summed to arrive at the total market capitalization of the S&P 500. This aggregate value serves as the numerator in the index calculation.
The total market cap of the index is
where \(\mathcal{I}_{S\&P, t}\) represents the set of constituents at time \(t\).
4. Calculate the Divisor#
The divisor is a critical element in the S&P 500 calculation. It is a proprietary value determined by S&P Dow Jones Indices and is used to scale the index down to a more manageable and reportable level. The divisor is subject to adjustments over time to account for various corporate actions that could influence the index’s value. These actions include:
Stock Splits: When a company splits its stock, the number of outstanding shares increases, while the price per share decreases proportionally. The divisor is adjusted to prevent this corporate action from artificially impacting the index level.
Dividends: While regular cash dividends do not directly affect the index level, special dividends, which are larger, one-time distributions, can influence the divisor.
Spinoffs: When a company spins off a subsidiary into a separate publicly traded entity, the divisor is adjusted to reflect the change in the index’s composition.
Index Reconstitution: The S&P 500 periodically adds or removes companies to ensure it continues to represent the leading U.S. equities. These changes also necessitate adjustments to the divisor.
The divisor plays a crucial role in maintaining the index’s continuity over time. The divisor adjustment process ensures that these actions do not introduce artificial jumps or drops in the index level. By adjusting for corporate actions and changes in the index’s constituents, the divisor ensures that the S&P 500 remains a consistent and reliable benchmark for tracking the performance of the U.S. equity market. The formula for the divisor adjustment depends on the specific corporate action and is detailed in S&P Dow Jones Indices’ methodology documents (see S&P Dow Jones Indices’ Index Mathematics Methodology).
Corporate Action |
Impact on Divisor |
Example |
|---|---|---|
Stock Split |
Adjusted to maintain index continuity |
A 2-for-1 stock split doubles shares but halves the price. |
Special Dividend |
Adjusted to account for the distribution |
A one-time dividend reduces market capitalization. |
Spinoff |
Adjusted to reflect the spun-off entity’s market cap |
A parent company spins off a subsidiary as a new company. |
Index Reconstitution |
Adjusted for additions/removals in the index |
A new company is added, requiring a reweighting of constituents. |
5. Calculate the Index Level#
Finally, the S&P 500 index level is calculated as:
where \(P_{i,t}\) is the stock price of company \(i\) at time \(t\), \(\text{Free-float Shrout}_{i,t}\) is the number of freely traded shares of company \(i\) at time \(t\), and \(\text{Divisor}\) is the proprietary scaling factor.
This formula aggregates the weighted market capitalizations of all constituents, scaled by the divisor to produce the reported index level.
Approximating the S&P 500 Index with data from CRSP#
As discussed in the introduction, the S&P 500 index is a proprietary calculation and not available through WRDS. However, we can approximate the index level and returns using data from CRSP. Here, we will implement two methods for approximating the index:
Method A: a simple sum of the market caps of the index’s constituents
Method B: an approximation that assumes that a trader forms a porfolio holding the index’s constituents and rebalances the portfolio quarterly. We will then compare the reconstructed index to the actual S&P 500 index by analyzing the correlation of levels and returns.
Data Required for Approximation#
To reconstruct the S&P 500 index, we need:
Constituent Data: Membership data indicating which stocks were part of the index at each point in time.
Price and Shares Outstanding Data: Stock-level data to calculate market capitalizations.
Index Level and Return Data: Official S&P 500 data for comparison and normalization.
Data on the historical constituents of the S&P 500 is available through WRDS, while price and shares outstanding data can be accessed through CRSP.
Let’s explore that data now.
import pandas as pd
import seaborn as sns
import wrds
from matplotlib import pyplot as plt
import calc_SP500_index
import pull_CRSP_stock
from settings import config
DATA_DIR = config("DATA_DIR")
WRDS_USERNAME = config("WRDS_USERNAME")
START_DATE = pd.to_datetime("1960-01-01")
END_DATE = pd.to_datetime("2023-12-29")
sns.set_theme()
Load the Constituent Data#
db = wrds.Connection(wrds_username=WRDS_USERNAME)
df_constituents = db.raw_sql("""
select *
from crsp_m_indexes.dsp500list_v2
-- limit 2000
""")
db.close()
# Convert string columns to datetime if they aren't already
df_constituents["mbrstartdt"] = pd.to_datetime(df_constituents["mbrstartdt"])
df_constituents["mbrenddt"] = pd.to_datetime(df_constituents["mbrenddt"])
df_constituents.head()
Loading library list...
Done
| permno | indno | mbrstartdt | mbrenddt | mbrflg | indfam | |
|---|---|---|---|---|---|---|
| 0 | 10006 | 1000500 | 1957-03-01 | 1984-07-18 | NORM | 1100500 |
| 1 | 10030 | 1000500 | 1957-03-01 | 1969-01-08 | NORM | 1100500 |
| 2 | 10049 | 1000500 | 1925-12-31 | 1932-10-01 | NORM | 1100500 |
| 3 | 10057 | 1000500 | 1957-03-01 | 1992-07-02 | NORM | 1100500 |
| 4 | 10078 | 1000500 | 1992-08-20 | 2010-01-28 | NORM | 1100500 |
df_constituents.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2080 entries, 0 to 2079
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 permno 2080 non-null Int64
1 indno 2080 non-null Int64
2 mbrstartdt 2080 non-null datetime64[ns]
3 mbrenddt 2080 non-null datetime64[ns]
4 mbrflg 2080 non-null string
5 indfam 2080 non-null Int64
dtypes: Int64(3), datetime64[ns](2), string(1)
memory usage: 103.7 KB
Dictionary of columns:
permno, int64: Permno of the constituent, where a permno is a unique identifier for a stock issued by CRSP.indno, int64: Index number of the constituent, where the index number is a unique identifier for the S&P 500 index.mbrstartdt, object: Membership start date of the constituent in the S&P 500 index (inclusive). This is the first date the stock IS a constituent.mbrenddt, object: Membership end date of the constituent in the S&P 500 index (exclusive). This is the first date the stock is NO LONGER a constituent. A stock is included in the index ifmbrstartdt <= date < mbrenddt.mbrflg, object: Membership flag of the constituent in the S&P 500 index.indfam, int64: Index family of the constituent, where the index family is a unique identifier for the S&P 500 index.
Note on date filtering: When filtering for constituents on a given date, use mbrstartdt <= date and mbrenddt > date (not >=). See Discussion #52 for details.
df_constituents.describe()
| permno | indno | mbrstartdt | mbrenddt | indfam | |
|---|---|---|---|---|---|
| count | 2080.0 | 2080.0 | 2080 | 2080 | 2080.0 |
| mean | 42265.837981 | 1000500.0 | 1982-03-07 09:48:27.692307712 | 2000-01-27 03:15:13.846153856 | 1100500.0 |
| min | 10006.0 | 1000500.0 | 1925-12-31 00:00:00 | 1928-08-24 00:00:00 | 1100500.0 |
| 25% | 18303.5 | 1000500.0 | 1957-03-01 00:00:00 | 1982-06-28 06:00:00 | 1100500.0 |
| 50% | 28349.0 | 1000500.0 | 1983-04-14 12:00:00 | 2002-05-12 00:00:00 | 1100500.0 |
| 75% | 68657.5 | 1000500.0 | 2003-03-23 12:00:00 | 2024-12-20 00:00:00 | 1100500.0 |
| max | 93436.0 | 1000500.0 | 2025-11-28 00:00:00 | 2025-11-28 00:00:00 | 1100500.0 |
| std | 27685.740222 | 0.0 | NaN | NaN | 0.0 |
df_constituents["permno"].nunique()
1952
df_constituents["mbrflg"].value_counts()
mbrflg
NORM 2080
Name: count, dtype: Int64
# There is only one index family for all the constituents.
df_constituents["indfam"].value_counts()
indfam
1100500 2080
Name: count, dtype: Int64
# There is only one index number for all the constituents.
df_constituents["indno"].value_counts()
indno
1000500 2080
Name: count, dtype: Int64
## Constituents at a given date
mydate = pd.to_datetime("2019-03-01")
# Note: mbrenddt is exclusive (first day stock is NOT a constituent)
# So we use ">" not ">=" for the end date comparison
mask = (df_constituents["mbrstartdt"] <= mydate) & (
df_constituents["mbrenddt"] > mydate
)
constituents = df_constituents[mask].sort_values(by=["mbrstartdt", "mbrenddt"])
constituents
| permno | indno | mbrstartdt | mbrenddt | mbrflg | indfam | |
|---|---|---|---|---|---|---|
| 353 | 15069 | 1000500 | 1925-12-31 | 2024-11-25 | NORM | 1100500 |
| 12 | 10145 | 1000500 | 1925-12-31 | 2025-11-28 | NORM | 1100500 |
| 102 | 11404 | 1000500 | 1925-12-31 | 2025-11-28 | NORM | 1100500 |
| 117 | 11674 | 1000500 | 1925-12-31 | 2025-11-28 | NORM | 1100500 |
| 130 | 11850 | 1000500 | 1925-12-31 | 2025-11-28 | NORM | 1100500 |
| ... | ... | ... | ... | ... | ... | ... |
| 2005 | 90520 | 1000500 | 2018-12-24 | 2025-03-21 | NORM | 1100500 |
| 170 | 12448 | 1000500 | 2019-01-02 | 2023-05-03 | NORM | 1100500 |
| 1231 | 44329 | 1000500 | 2019-01-18 | 2025-03-21 | NORM | 1100500 |
| 1079 | 32986 | 1000500 | 2019-02-15 | 2025-11-28 | NORM | 1100500 |
| 1780 | 81677 | 1000500 | 2019-02-27 | 2025-11-28 | NORM | 1100500 |
505 rows Ă— 6 columns
## Notice that the constituents change at the end of March.
mydate = pd.to_datetime("2019-04-01")
# Note: mbrenddt is exclusive (first day stock is NOT a constituent)
# So we use ">" not ">=" for the end date comparison
mask = (df_constituents["mbrstartdt"] <= mydate) & (
df_constituents["mbrenddt"] > mydate
)
constituents = df_constituents[mask].sort_values(by=["mbrstartdt", "mbrenddt"])
constituents
| permno | indno | mbrstartdt | mbrenddt | mbrflg | indfam | |
|---|---|---|---|---|---|---|
| 353 | 15069 | 1000500 | 1925-12-31 | 2024-11-25 | NORM | 1100500 |
| 12 | 10145 | 1000500 | 1925-12-31 | 2025-11-28 | NORM | 1100500 |
| 102 | 11404 | 1000500 | 1925-12-31 | 2025-11-28 | NORM | 1100500 |
| 117 | 11674 | 1000500 | 1925-12-31 | 2025-11-28 | NORM | 1100500 |
| 130 | 11850 | 1000500 | 1925-12-31 | 2025-11-28 | NORM | 1100500 |
| ... | ... | ... | ... | ... | ... | ... |
| 1231 | 44329 | 1000500 | 2019-01-18 | 2025-03-21 | NORM | 1100500 |
| 1079 | 32986 | 1000500 | 2019-02-15 | 2025-11-28 | NORM | 1100500 |
| 1780 | 81677 | 1000500 | 2019-02-27 | 2025-11-28 | NORM | 1100500 |
| 530 | 18420 | 1000500 | 2019-03-19 | 2025-11-28 | NORM | 1100500 |
| 531 | 18421 | 1000500 | 2019-03-19 | 2025-11-28 | NORM | 1100500 |
505 rows Ă— 6 columns
Load the CRSP Stock and Index Data#
df_msf is the CRSP monthly stock file. This file contains the price and shares outstanding data for all stocks in the CRSP database.
permno: Permno of the stock, which is a unique identifier for a stock issued by CRSP.date: Date of the observation.prc: Price of the stock.shrout: Shares outstanding of the stock.exchcd: Exchange code of the stock.shrcd: Share code of the stock.siccd: Standard Industrial Classification code of the stock.cfacshr: Cumulative adjustment factor for shares outstanding. This adjust, e.g., for stock splits.cfacpr: Cumulative adjustment factor for price. This adjust, e.g., for stock splits.ret: Return of the stock, including dividends.retx: Return of the stock, excluding dividends.
The cumulative adjustment factors are used to adjust for corporate actions that affect the stock price, such as stock splits. So
df_msf["adj_shrout"] = df_msf["shrout"] * df_msf["cfacshr"]
df_msf["adj_prc"] = df_msf["prc"].abs() / df_msf["cfacpr"]
Total market cap is then
df_msf["market_cap"] = df_msf["adj_prc"] * df_msf["adj_shrout"].
Note that the price is converted to an absolute value because CRSP reports negative prices when price data is missing and quote data is used in its place. Negative prices indicate that CRSP has used the midpoint of the bid-ask spread as the price. Since we don’t mind using quote data or price data, we can safely convert the price to an absolute value.
Also, as an additional note, CRSP reports that “cfacshr” and “cfacpr” are not always equal. This means that we cannot use market_cap = prc * shrout alone. We need to use the cumulative adjustment factors to adjust for corporate actions that affect the stock price, such as stock splits. “cfacshr” and “cfacpr” are not always equal because of less common distribution events, spinoffs, and rights. See here: CRSP - Useful Variables
df_msf = pull_CRSP_stock.load_CRSP_monthly_file(data_dir=DATA_DIR)
df_msix = pull_CRSP_stock.load_CRSP_index_files(data_dir=DATA_DIR)
df_msf.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4007330 entries, 0 to 7329
Data columns (total 25 columns):
# Column Dtype
--- ------ -----
0 permno Int64
1 permco Int64
2 date datetime64[ns]
3 ret Float64
4 retx Float64
5 shrout Int64
6 prc Float64
7 vol Float64
8 cfacshr Float64
9 cfacpr Float64
10 primaryexch string
11 siccd Int64
12 naics string
13 issuertype string
14 securitytype string
15 securitysubtype string
16 sharetype string
17 usincflg string
18 tradingstatusflg string
19 conditionaltype string
20 altprc Float64
21 adj_shrout Float64
22 adj_prc Float64
23 market_cap Float64
24 jdate datetime64[ns]
dtypes: Float64(10), Int64(4), datetime64[ns](2), string(9)
memory usage: 848.4 MB
df_msix is the CRSP monthly index file. This file contains the return and level of the S&P 500 index.
caldt: Date of the observation.sprtrn: Return of the S&P 500 index.spindx: Level of the S&P 500 index.
This dataframe also contains information about the CRSP value-weighted index and equal-weighted index, along with the 10 deciles of the CRSP market cap index: ((Monthly)NYSE/AMEX/NASDAQ Capitalization Deciles, Annual Rebalanced (msix)) https://wrds-www.wharton.upenn.edu/data-dictionary/crsp_a_indexes/msix/
df_msix.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 720 entries, 0 to 719
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 caldt 720 non-null datetime64[ns]
1 vwretd 720 non-null Float64
2 vwindd 720 non-null Float64
3 vwretx 720 non-null Float64
4 vwindx 720 non-null Float64
5 ewretd 720 non-null Float64
6 ewindd 720 non-null Float64
7 ewretx 720 non-null Float64
8 ewindx 720 non-null Float64
9 sprtrn 720 non-null Float64
10 spindx 720 non-null Float64
11 decret1 720 non-null Float64
12 decind1 720 non-null Float64
13 decret2 720 non-null Float64
14 decind2 720 non-null Float64
15 decret3 720 non-null Float64
16 decind3 720 non-null Float64
17 decret4 720 non-null Float64
18 decind4 720 non-null Float64
19 decret5 720 non-null Float64
20 decind5 720 non-null Float64
21 decret6 720 non-null Float64
22 decind6 720 non-null Float64
23 decret7 720 non-null Float64
24 decind7 720 non-null Float64
25 decret8 720 non-null Float64
26 decind8 720 non-null Float64
27 decret9 720 non-null Float64
28 decind9 720 non-null Float64
29 decret10 720 non-null Float64
30 decind10 720 non-null Float64
31 totval 720 non-null Float64
32 totcnt 720 non-null Int64
33 usdval 720 non-null Float64
34 usdcnt 720 non-null Int64
dtypes: Float64(32), Int64(2), datetime64[ns](1)
memory usage: 220.9 KB
df_msix[["caldt", "sprtrn", "spindx"]].tail()
| caldt | sprtrn | spindx | |
|---|---|---|---|
| 715 | 2024-08-30 | 0.022835 | 5648.4 |
| 716 | 2024-09-30 | 0.020197 | 5762.48 |
| 717 | 2024-10-31 | -0.009897 | 5705.45 |
| 718 | 2024-11-29 | 0.057301 | 6032.38 |
| 719 | 2024-12-31 | -0.02499 | 5881.63 |
Method A: Simple Sum of Market Caps#
Market Cap Calculation: For each stock in the index, compute the market capitalization as: $\( \text{MarketCap}_{i, t} = P_{i,t} \times \text{Shrout}_{i,t}. \)\( The total market cap of the index is: \)\( \text{MarketCapSP500}_{t} = \sum_{i \in \mathcal{I}_{S\&P, t}} \text{MarketCap}_{i, t}, \)\( where \)\mathcal{I}_{S&P, t}\( represents the set of constituents at time \)t$.
Index Level Approximation: Using the market cap ratio, calculate the approximated index level: $\( \text{IndexLevelApproxA}_{t} = \frac{\text{MarketCapSP500}_{t}}{\text{MarketCapSP500}_{t-1}} \times \text{IndexLevelApproxA}_{t-1}. \)\( Normalize \)\text{IndexLevelApproxA}$ at the start date to match the actual S&P 500 level.
Return Approximation: Calculate the returns as: $\( R_{A, t} = \frac{\text{IndexLevelApproxA}_{t}}{\text{IndexLevelApproxA}_{t-1}} - 1. \)$
sp500_total_market_cap = calc_SP500_index.calculate_sp500_total_market_cap(
df_constituents, df_msf
)
sp500_total_market_cap.tail()
| date | sp500_market_cap | n_constituents | n_constituents_avail | |
|---|---|---|---|---|
| 415 | 2024-08-30 | 4.932769e+13 | 503 | 503 |
| 416 | 2024-09-30 | 5.049335e+13 | 503 | 503 |
| 417 | 2024-10-31 | 4.999039e+13 | 503 | 503 |
| 418 | 2024-11-29 | 5.295661e+13 | 503 | 503 |
| 419 | 2024-12-31 | 5.181233e+13 | 503 | 503 |
sns.lineplot(data=sp500_total_market_cap, x="date", y="sp500_market_cap")
<Axes: xlabel='date', ylabel='sp500_market_cap'>
sp500_total_market_cap_and_returns = (
calc_SP500_index.append_actual_sp500_index_and_approx_returns_A(
sp500_total_market_cap, df_msix
)
)
sp500_total_market_cap_and_returns.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420 entries, 0 to 419
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 420 non-null datetime64[ns]
1 spindx 420 non-null Float64
2 sprtrn 420 non-null Float64
3 sp500_market_cap 420 non-null float64
4 n_constituents 420 non-null int64
5 n_constituents_avail 420 non-null int64
6 sp500_market_cap_norm 420 non-null float64
7 ret_approx_A 419 non-null float64
8 cumret_approx_A 419 non-null float64
9 sp500_cumret 420 non-null Float64
dtypes: Float64(3), datetime64[ns](1), float64(4), int64(2)
memory usage: 34.2 KB
# Plot cumulative returns
plt.figure(figsize=(12, 6))
sns.lineplot(
data=sp500_total_market_cap_and_returns,
x="date",
y="cumret_approx_A",
label="Cumulative Return (Approximation A)",
)
sns.lineplot(
data=sp500_total_market_cap_and_returns,
x="date",
y="sp500_cumret",
label="Cumulative Return (S&P 500)",
)
plt.title("S&P 500: Cumulative Returns Comparison")
plt.xlabel("Date")
plt.ylabel("Cumulative Return (1 = Initial Value)")
plt.grid(True)
plt.legend()
plt.show()
# Print correlation of returns
correlation = sp500_total_market_cap_and_returns["ret_approx_A"].corr(
sp500_total_market_cap_and_returns["sprtrn"]
)
print(f"Correlation between returns: {correlation:.4f}")
Correlation between returns: 0.9964
Notice that the difference in the cumulative returns is quite large by the end of the sample. However, the correlation between the returns is quite high (over 0.99). This just shows the consequences of compounding small differences over a long period of time.
Part of the reason why these series are so different is that the approximation method does not account for quarterly rebalancing.
Why Method A Does Not Represent a Self-Financing Portfolio Strategy
A self-financing trading strategy is one where no money is added or withdrawn after the initial investment—all rebalancing is funded by selling existing holdings. Mathematically, if \(w_{i,t}\) denotes the portfolio weight of stock \(i\) at time \(t\), a self-financing portfolio earns return: $\( R_{t+1} = \sum_i w_{i,t} \cdot R_{i,t+1} \)$ That is, the portfolio return equals the weighted average of individual stock returns, where weights are determined before observing returns.
Method A does not satisfy this property. It calculates returns as the percentage change in total market cap: $\( R_{A,t} = \frac{\sum_{i \in \mathcal{I}_t} \text{MarketCap}_{i,t}}{\sum_{j \in \mathcal{I}_{t-1}} \text{MarketCap}_{j,t-1}} - 1 \)$
When the index composition changes between \(t-1\) and \(t\), the numerator and denominator sum over different sets of stocks. For example, if stock B is removed and stock C is added, Method A effectively assumes you could buy C and sell B at their prior prices—but you cannot. Transactions must occur at current market prices. This creates a mismatch that does not correspond to any tradable portfolio.
Method B (described next) resolves this by computing portfolio returns using weights that are known at the start of each period, which is the defining property of a self-financing strategy.
Method B: Portfolio Rebalancing#
This method assumes that a trader forms a porfolio holding the index’s constituents and rebalances the portfolio quarterly. Let’s assume that the total amount of money in the portfolio at the beginning of the period is 1 (e.g. $1 million). Then, assume that at each rebalancing date (in this case, quarterly), the trader buys and sells stocks such that the portfolio weights are equal to the weights of the index’s constituents with respect to the ratio of the market cap of each stock to the total market cap of the index.
Weight Calculation: For each stock \(i\), calculate its weight in the index: $\( w_{i, t} = \begin{cases} \frac{\text{MarketCap}_{i, t}}{\text{MarketCapSP500}_{t}} & \text{if } i \in \mathcal{I}_{S\&P, t} \\ 0 & \text{otherwise} \end{cases} \)$
Portfolio Rebalancing: Implement quarterly rebalancing. The portfolio weights \(w^p_{i, t}\) are given by $\( w^p_{i, t} = \begin{cases} w_{i, t} & \text{if } t \text{ is a rebalancing date}, \\ w^p_{i, t-1} & \text{otherwise}, \end{cases} \)\( for all \)i\( and \)t$.
Return Approximation: Calculate the portfolio return as: $\( R_{B, t+1} = \sum_{i} w^p_{i, t} \times R_{i, t+1}, \)\( where \)R_{i, t+1}\( is the return of stock \)i\( between \)t\( and \)t+1$.
Index Level Approximation: Use the approximated returns to calculate the index level, normalized to match the official S&P 500 at the start date.
IMPORTANT: To make this method work as accurately as possible, use the retx variable for \(R_{i, t+1}\). This is because the ret variable includes dividends, which are not part of the S&P 500 index.
sp500_returns_B = calc_SP500_index.calculate_sp500_returns_with_rebalancing(
df_constituents, df_msf, start_date=START_DATE, end_date=END_DATE
)
df_msix["date"] = pd.to_datetime(df_msix["caldt"])
sp500_returns_B = pd.merge(
df_msix[["date", "spindx", "sprtrn"]],
sp500_returns_B,
on="date",
how="inner",
)
sp500_returns_B["diff"] = sp500_returns_B["ret_approx_B"] - sp500_returns_B["sprtrn"]
sns.lineplot(data=sp500_returns_B, x="date", y="diff", label="Rebalanced Portfolio")
sp500_returns_B.describe()
| date | spindx | sprtrn | ret_approx_B | diff | |
|---|---|---|---|---|---|
| count | 708 | 708.0 | 708.0 | 708.0 | 708.0 |
| mean | 1994-07-15 12:48:48.813559296 | 973.548291 | 0.006671 | 0.006635 | -0.000035 |
| min | 1965-01-29 00:00:00 | 63.54 | -0.21763 | -0.217689 | -0.008328 |
| 25% | 1979-10-22 18:00:00 | 110.955 | -0.018313 | -0.018645 | -0.0006 |
| 50% | 1994-07-14 12:00:00 | 465.005 | 0.009341 | 0.009494 | -0.000012 |
| 75% | 2009-04-07 12:00:00 | 1338.1875 | 0.03532 | 0.034908 | 0.00056 |
| max | 2023-12-29 00:00:00 | 4769.83 | 0.163047 | 0.164457 | 0.01106 |
| std | NaN | 1098.225895 | 0.043688 | 0.043805 | 0.001569 |
# Print correlation
correlation = sp500_returns_B["ret_approx_B"].corr(sp500_returns_B["sprtrn"])
print(f"Correlation between reconstructed and actual returns: {correlation:.4f}")
Correlation between reconstructed and actual returns: 0.9994
# Calculate cumulative returns
sp500_returns_B["cumret_approx_B"] = (1 + sp500_returns_B["ret_approx_B"]).cumprod()
sp500_returns_B["cumret_actual"] = (1 + sp500_returns_B["sprtrn"]).cumprod()
# Plot cumulative returns
plt.figure(figsize=(12, 6))
sns.lineplot(
data=sp500_returns_B,
x="date",
y="cumret_approx_B",
label="Cumulative Return (Rebalanced Portfolio)",
)
sns.lineplot(
data=sp500_returns_B,
x="date",
y="cumret_actual",
label="Cumulative Return (S&P 500)",
)
plt.title("S&P 500: Cumulative Returns with Quarterly Rebalancing")
plt.xlabel("Date")
plt.ylabel("Cumulative Return (1 = Initial Value)")
plt.grid(True)
plt.legend()
plt.show()
Above we can now see that the cumulative returns are very similar. The correlation between the returns is even higher (over 0.998).
Comparing the Approximations#
Once both methods are implemented, compare the reconstructed index to the actual S&P 500 by:
Correlation of Returns: Measure the correlation between the approximated returns and the official S&P 500 returns.
Index Level Alignment: Plot the approximated index levels against the official levels to assess how well each method tracks the S&P 500 over time.
Impact of Rebalancing: Analyze how quarterly rebalancing affects the accuracy of the reconstruction.
## Function that merges all results together
df = calc_SP500_index.create_sp500_index_approximations(data_dir=DATA_DIR)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420 entries, 0 to 419
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 420 non-null datetime64[ns]
1 spindx 420 non-null Float64
2 sprtrn 420 non-null Float64
3 sp500_market_cap 420 non-null float64
4 n_constituents 420 non-null int64
5 n_constituents_avail 420 non-null int64
6 sp500_market_cap_norm 420 non-null float64
7 ret_approx_A 419 non-null float64
8 cumret_approx_A 419 non-null float64
9 sp500_cumret 420 non-null Float64
10 ret_approx_B 419 non-null Float64
dtypes: Float64(4), datetime64[ns](1), float64(4), int64(2)
memory usage: 37.9 KB
df[["sprtrn", "ret_approx_A", "ret_approx_B"]].corr()
| sprtrn | ret_approx_A | ret_approx_B | |
|---|---|---|---|
| sprtrn | 1.000000 | 0.996360 | 0.999188 |
| ret_approx_A | 0.996360 | 1.000000 | 0.995615 |
| ret_approx_B | 0.999188 | 0.995615 | 1.000000 |
Above we saw that the rebalanced portfolio method is more accurate than the simple sum of market caps method. The cumulative returns are much more similar and the correlation between the returns is higher.