HW Guide Part A: CRSP Market Returns Indices#
The CRSP (Center for Research in Security Prices) dataset provides two indices
for market returns: an equal-weighted index and a value-weighted index (both provided
in terms of returns with and without dividends). The equal-weighted index
computes the simple average of returns across stocks. This series is available as EWRETD
and EWRETX
, (with and without dividends, respectively).
The value-Weighted Returns index represents a stock market index that calculates the return on investment by considering both the price changes and dividends of each component security, weighted by its market capitalization. This means that larger companies have a greater impact on the index’s performance compared to smaller companies. The value-weighting approach aims to reflect the actual investment returns that an investor would achieve by holding a market portfolio, mirroring the performance of the overall market or specific market segments more accurately than equal-weighted indices. The CRSP indices are widely used in academic research and financial analysis to study market trends, evaluate investment strategies, and benchmark the performance of portfolios against the broader market. This series is available in the CRSP tables under the mnemonic VWRETD
and VWRETX
(with and without dividends, respectively).
In this guide, we’ll discuss the construction of the equal- and value-weighted market return indices. To construct these indices, we’ll follow the suggestions here: https://wrds-www.wharton.upenn.edu/pages/support/support-articles/crsp/index-and-deciles/constructing-value-weighted-return-series-matches-vwretd-crsp-monthly-value-weighted-returns-includes-distributions/
These suggestions boil down to the most important part: we must select the correct universe of stocks that comprise “the market”.
import pandas as pd
from settings import config
import pull_CRSP_stock
import calc_CRSP_indices
import misc_tools
DATA_DIR = config("DATA_DIR")
df_msf = pull_CRSP_stock.load_CRSP_monthly_file(data_dir=DATA_DIR)
df_msix = pull_CRSP_stock.load_CRSP_index_files(data_dir=DATA_DIR)
df_msix.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 696 entries, 0 to 695
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 caldt 696 non-null datetime64[ns]
1 vwretd 696 non-null float64
2 vwindd 696 non-null float64
3 vwretx 696 non-null float64
4 vwindx 696 non-null float64
5 ewretd 696 non-null float64
6 ewindd 696 non-null float64
7 ewretx 696 non-null float64
8 ewindx 696 non-null float64
9 sprtrn 696 non-null float64
10 spindx 696 non-null float64
11 decret1 696 non-null float64
12 decind1 696 non-null float64
13 decret2 696 non-null float64
14 decind2 696 non-null float64
15 decret3 696 non-null float64
16 decind3 696 non-null float64
17 decret4 696 non-null float64
18 decind4 696 non-null float64
19 decret5 696 non-null float64
20 decind5 696 non-null float64
21 decret6 696 non-null float64
22 decind6 696 non-null float64
23 decret7 696 non-null float64
24 decind7 696 non-null float64
25 decret8 696 non-null float64
26 decind8 696 non-null float64
27 decret9 696 non-null float64
28 decind9 696 non-null float64
29 decret10 696 non-null float64
30 decind10 696 non-null float64
31 totval 696 non-null float64
32 totcnt 696 non-null int64
33 usdval 696 non-null float64
34 usdcnt 696 non-null int64
dtypes: datetime64[ns](1), float64(32), int64(2)
memory usage: 190.4 KB
df_msix[
[
"caldt",
"vwretd",
"vwretx",
"vwindx",
"ewretd",
"ewretx",
]
].tail()
caldt | vwretd | vwretx | vwindx | ewretd | ewretx | |
---|---|---|---|---|---|---|
691 | 2022-08-31 | -0.036383 | -0.038193 | 3109.090 | -0.007503 | -0.009162 |
692 | 2022-09-30 | -0.092081 | -0.093549 | 2818.238 | -0.106628 | -0.108430 |
693 | 2022-10-31 | 0.078669 | 0.077464 | 3036.551 | 0.051363 | 0.050314 |
694 | 2022-11-30 | 0.051431 | 0.049403 | 3186.567 | 0.019866 | 0.018111 |
695 | 2022-12-30 | -0.058719 | -0.060326 | 2994.333 | -0.050828 | -0.053276 |
Inclusion into the CRSP Market Index:#
Our experiments with different VWRETD replication methods show that it is relatively easy to come close to this data series using PERMNO-based returns in the CRSP datasets, but exact matches to every data month is not possible because we do not know the exact sample set of PERMNOs used by CRSP. Their criteria is listed in the CRSP manual and is roughly:
CRSP CAP-BASED PORTFOLIOS – The following types of securities, listed on NYSE, AMEX, and Nasdaq National Market, are eligible for inclusion in the Cap-Based Indices:
Common Stocks
Certificates
Shares of Beneficial Interest
Units (Depository Units, Units of Beneficial Interest, Units of Limited Partnership Interest, Depository Receipts, etc.)
The following types of securities are NOT eligible for inclusion in the Cap-Based Indices:
ADRs
Closed-End Mutual Funds, WEBS Index Funds, Unit Investment Trusts
All Common Stocks with non-US Incorporation
Americus Trust Components
HOLDRs Trusts
REITs (Real Estate Investment Trusts)
Rights and Warrants
Preferred stock
“Packaged” Units (Common Stocks Bundled with Rights or Warrants)
Over-the-Counter Bulletin Board Issues
N.B. The Cap-Based Indices do include returns from time ranges during which eligible securities trade on “leading prices” or “reorganization” when-issued status. The Cap-Based Indices do NOT include returns from time ranges during which eligible securities trade on “ex-distribution” or “additional” when-issued status.
Note that VWRETD is not computed by WRDS but provided directly by CRSP along with the PERMNO based returns. For general SAS coding help for this problem see the WRDS Research Application: Portfolios by Size and Book-to-Market. This WRDS Support document provides examples of cap-based decile breakdowns, but the same general principles apply to the total market index.
I’ve provided code for you that will take care of this subsetting in the function pull_CRSP_monthly_file
:
SELECT
date,
msf.permno, msf.permco, shrcd, exchcd, comnam, shrcls,
ret, retx, dlret, dlretx, dlstcd,
prc, altprc, vol, shrout, cfacshr, cfacpr,
naics, siccd
FROM crsp.msf AS msf
LEFT JOIN
crsp.msenames as msenames
ON
msf.permno = msenames.permno AND
msenames.namedt <= msf.date AND
msf.date <= msenames.nameendt
LEFT JOIN
crsp.msedelist as msedelist
ON
msf.permno = msedelist.permno AND
date_trunc('month', msf.date)::date =
date_trunc('month', msedelist.dlstdt)::date
WHERE
msf.date BETWEEN '{start_date}' AND '{end_date}' AND
msenames.shrcd IN (10, 11, 20, 21, 40, 41, 70, 71, 73)
To best understand this, please look up shrcd
in the Data Manual here: https://wrds-www.wharton.upenn.edu/documents/396/CRSP_US_Stock_Indices_Data_Descriptions.pdf . You’ll find the information on p. 81.
Calculation of Equal-Weighted Returns and Value-Weighted Returns#
With the proper universe of stocks in hand, all that is left is to group the returns by permno
(the identifier of choice here) and average. However, the equal weighted average is a mere simple average. To calculate the value-weighted average, we need to calculate the lagged market cap of each stock \(i\) at time \(t\).
That is, the value-weighted return is given by the following formula:
where \(w_{i,t-1}\) is the market capitalization of stock \(i\) at time \(t-1\) and
\(r_t\) can be the returns with dividends ret
or the returns without dividends retx
.
The market capitalization of a stock is its price times the shares outstanding,
$\(
w_{it} = \text{SHROUT}_{it} \times \text{PRC}_{it}.
\)$
df_eq_idx = calc_CRSP_indices.calc_equal_weighted_index(df_msf)
df_vw_idx = calc_CRSP_indices.calc_CRSP_value_weighted_index(df_msf)
df_idxs = calc_CRSP_indices.calc_CRSP_indices_merge(df_msf, df_msix)
df_idxs[
[
"vwretd",
"vwretx",
"ewretd",
"ewretx",
"vwretd_manual",
"vwretx_manual",
"ewretd_manual",
"ewretx_manual",
]
].head()
vwretd | vwretx | ewretd | ewretx | vwretd_manual | vwretx_manual | ewretd_manual | ewretx_manual | |
---|---|---|---|---|---|---|---|---|
date | ||||||||
1965-03-31 | -0.009715 | -0.011428 | 0.015393 | 0.012949 | -0.010123 | -0.011815 | 0.015279 | 0.012819 |
1965-04-30 | 0.033652 | 0.032737 | 0.041560 | 0.040315 | 0.034408 | 0.033484 | 0.042506 | 0.041257 |
1965-06-30 | -0.051868 | -0.053592 | -0.083718 | -0.086084 | -0.051150 | -0.052780 | -0.084370 | -0.086684 |
1965-08-31 | 0.030883 | 0.026133 | 0.041844 | 0.039063 | 0.030704 | 0.025896 | 0.041610 | 0.038854 |
1965-09-30 | 0.031952 | 0.030357 | 0.027420 | 0.025166 | 0.032003 | 0.030430 | 0.026972 | 0.024740 |
df_idxs[
[
"vwretd",
"vwretx",
"ewretd",
"ewretx",
"vwretd_manual",
"vwretx_manual",
"ewretd_manual",
"ewretx_manual",
]
].corr()
vwretd | vwretx | ewretd | ewretx | vwretd_manual | vwretx_manual | ewretd_manual | ewretx_manual | |
---|---|---|---|---|---|---|---|---|
vwretd | 1.000000 | 0.999475 | 0.860171 | 0.860201 | 0.997999 | 0.997375 | 0.857553 | 0.857803 |
vwretx | 0.999475 | 1.000000 | 0.860499 | 0.860767 | 0.997391 | 0.997980 | 0.857868 | 0.858413 |
ewretd | 0.860171 | 0.860499 | 1.000000 | 0.999933 | 0.842205 | 0.842409 | 0.997593 | 0.997505 |
ewretx | 0.860201 | 0.860767 | 0.999933 | 1.000000 | 0.842226 | 0.842695 | 0.997539 | 0.997595 |
vwretd_manual | 0.997999 | 0.997391 | 0.842205 | 0.842226 | 1.000000 | 0.999292 | 0.840877 | 0.841120 |
vwretx_manual | 0.997375 | 0.997980 | 0.842409 | 0.842695 | 0.999292 | 1.000000 | 0.841071 | 0.841651 |
ewretd_manual | 0.857553 | 0.857868 | 0.997593 | 0.997539 | 0.840877 | 0.841071 | 1.000000 | 0.999911 |
ewretx_manual | 0.857803 | 0.858413 | 0.997505 | 0.997595 | 0.841120 | 0.841651 | 0.999911 | 1.000000 |
As you can see above, our manually-created return index doesn’t match the CRSP index perfectly but is still very close. In this HW, you’ll be required to construct this index only approximately. A loose match, as seen here, will be fine.
Note, a helpful tool to create the lagged time series for market capitalization is provided in misc_tools
.
Use the function with_lagged_column
, which will create a lagged column that accounts for the fact that multiple stocks show up in a flat file. See the following example:
a = [
[1, "1990/1/1", 1],
[1, "1990/2/1", 2],
[1, "1990/3/1", 3],
[2, "1989/12/1", 3],
[2, "1990/1/1", 3],
[2, "1990/2/1", 4],
[2, "1990/3/1", 5.5],
[2, "1990/4/1", 5],
[2, "1990/6/1", 6],
]
data = pd.DataFrame(a, columns=["id", "date", "value"])
data["date"] = pd.to_datetime(data["date"])
data
id | date | value | |
---|---|---|---|
0 | 1 | 1990-01-01 | 1.0 |
1 | 1 | 1990-02-01 | 2.0 |
2 | 1 | 1990-03-01 | 3.0 |
3 | 2 | 1989-12-01 | 3.0 |
4 | 2 | 1990-01-01 | 3.0 |
5 | 2 | 1990-02-01 | 4.0 |
6 | 2 | 1990-03-01 | 5.5 |
7 | 2 | 1990-04-01 | 5.0 |
8 | 2 | 1990-06-01 | 6.0 |
data_lag = misc_tools.with_lagged_columns(
df=data, column_to_lag="value", id_column="id", lags=1, freq="MS"
)
data_lag
id | date | value | L1_value | |
---|---|---|---|---|
2 | 1 | 1990-01-01 | 1.0 | NaN |
4 | 1 | 1990-02-01 | 2.0 | 1.0 |
6 | 1 | 1990-03-01 | 3.0 | 2.0 |
8 | 1 | 1990-04-01 | NaN | 3.0 |
1 | 2 | 1989-12-01 | 3.0 | NaN |
3 | 2 | 1990-01-01 | 3.0 | 3.0 |
5 | 2 | 1990-02-01 | 4.0 | 3.0 |
7 | 2 | 1990-03-01 | 5.5 | 4.0 |
9 | 2 | 1990-04-01 | 5.0 | 5.5 |
11 | 2 | 1990-05-01 | NaN | 5.0 |
13 | 2 | 1990-06-01 | 6.0 | NaN |
As you can see, naively using shift
to create our lag would miss the fact that observation 1989-12-01
for stock id=2
should have a missing lagged value
. For example, the following would be incorrect:
data["value"].shift(1)
0 NaN
1 1.0
2 2.0
3 3.0
4 3.0
5 3.0
6 4.0
7 5.5
8 5.0
Name: value, dtype: float64