Data Sources Overview#

Quick look at each raw dataset pulled by the pipeline. Fourteen datasets are currently collected:

  1. Fama-French SIC Codes β€” industry classification ranges (Dartmouth/Kenneth French)

  2. FISD Amount Outstanding History β€” historical par amount changes (WRDS)

  3. FISD Issue β€” bond-level reference data (WRDS)

  4. FISD Issuer β€” issuer-level reference data (WRDS)

  5. FISD Ratings (Moody’s) β€” Moody’s bond ratings history (WRDS)

  6. FISD Ratings (S&P) β€” S&P bond ratings history (WRDS)

  7. FISD Redemption β€” callable/redemption flags (WRDS)

  8. Liu-Wu Treasury Yields β€” zero-coupon yield curve (Google Sheets)

  9. OSBAP Corporate Bond Returns β€” monthly bond returns; used for Stage 1 validation only (OpenBondAssetPricing.com)

  10. OSBAP Linker (Fang, 2025) β€” bond-firm link mapping CUSIPs to PERMNO/PERMCO/GVKEY; core Stage 1 input (OpenBondAssetPricing.com)

  11. OSBAP Treasury Bond Returns β€” treasury returns; currently documentation only (OpenBondAssetPricing.com)

  12. TRACE 144A β€” Rule 144A private placement trades (WRDS)

  13. TRACE Enhanced β€” trade-level transaction data (WRDS)

  14. TRACE Standard β€” trade-level transaction data (WRDS)

from pathlib import Path

import polars as pl

from settings import config

DATA_DIR = Path(config("DATA_DIR"))
PULL_DIR = DATA_DIR / "pulled"

1. Fama-French SIC Codes β€” pulled/ff17_sic_ranges.parquet + pulled/ff30_sic_ranges.parquet#

Source script: pull_fama_french_sic.py | Source: Kenneth French Data Library (Dartmouth)

Industry-to-SIC-code range mappings from the Fama-French 17- and 30-industry classification systems. Each row maps an industry number and name to a contiguous SIC code range. Used by Stage 1 for industry classification.

Key columns:

Column

Description

ind_num

Industry number in the FF classification

ind_name

Short industry name

sic_start

Start of SIC code range

sic_end

End of SIC code range

ff17 = pl.scan_parquet(PULL_DIR / "ff17_sic_ranges.parquet")
n_rows_ff17 = ff17.select(pl.len()).collect().item()
cols_ff17 = ff17.collect_schema().names()
print(f"FF17 β€” Rows: {n_rows_ff17:,}  |  Columns: {len(cols_ff17)}")
print(f"Column names: {cols_ff17}")
FF17 β€” Rows: 567  |  Columns: 4
Column names: ['ind_num', 'ind_name', 'sic_start', 'sic_end']

Example rows β€” FF17#

ff17.head(5).collect()
shape: (5, 4)
ind_numind_namesic_startsic_end
i64stri64i64
1"Food"100199
1"Food"200299
1"Food"700799
1"Food"900999
1"Food"20002009
ff30 = pl.scan_parquet(PULL_DIR / "ff30_sic_ranges.parquet")
n_rows_ff30 = ff30.select(pl.len()).collect().item()
cols_ff30 = ff30.collect_schema().names()
print(f"FF30 β€” Rows: {n_rows_ff30:,}  |  Columns: {len(cols_ff30)}")
print(f"Column names: {cols_ff30}")
FF30 β€” Rows: 587  |  Columns: 4
Column names: ['ind_num', 'ind_name', 'sic_start', 'sic_end']

Example rows β€” FF30#

ff30.head(5).collect()
shape: (5, 4)
ind_numind_namesic_startsic_end
i64stri64i64
1"Food"100199
1"Food"200299
1"Food"700799
1"Food"910919
1"Food"20002009

2. FISD Amount Outstanding History β€” pulled/fisd_amt_out_hist.parquet#

Source script: pull_fisd.py | WRDS table: fisd.fisd_amt_out_hist

Historical changes in par amount outstanding for each bond issue. Records actions such as increases, decreases, and maturities that change the total amount of a bond outstanding over time.

amt_out = pl.scan_parquet(PULL_DIR / "fisd_amt_out_hist.parquet")
n_rows_amt = amt_out.select(pl.len()).collect().item()
cols_amt = amt_out.collect_schema().names()
print(f"Rows: {n_rows_amt:,}  |  Columns: {len(cols_amt)}")
print(f"Column names: {cols_amt}")
Rows: 862,582  |  Columns: 7
Column names: ['transaction_id', 'issue_id', 'action_type', 'effective_date', 'action_price', 'action_amount', 'amount_outstanding']

Example rows#

amt_out.head(5).collect()
shape: (5, 7)
transaction_idissue_idaction_typeeffective_dateaction_priceaction_amountamount_outstanding
f64f64strstrf64f64f64
1.01.0"IL""1994-05-31"nullnull65000.0
58254.01.0"REV""1995-05-31"nullnull65000.0
83800.01.0"REV""1996-05-31"nullnull65000.0
129007.01.0"REV""1997-05-31"nullnull65000.0
168464.01.0"REV""1998-05-31"nullnull65000.0

3. FISD Issue Data β€” pulled/fisd_issue.parquet#

Source script: pull_fisd.py | WRDS table: fisd.fisd_mergedissue

One row per fixed-income security (bond issue) in the Mergent FISD database. Contains bond characteristics such as coupon, maturity, and offering details.

Key columns:

Column

Description

complete_cusip

9-character CUSIP identifier

issue_name

Descriptive name of the bond issue

issuer_id

Foreign key to the issuer table

coupon

Annual coupon rate (%)

coupon_type

Coupon type (F=fixed, V=variable, etc.)

maturity

Maturity date

offering_date

Original offering date

bond_type

Bond type code (CDEB, CMTN, etc.)

offering_amt

Offering amount ($ millions)

principal_amt

Par value per bond (typically 1000)

fisd_issue = pl.scan_parquet(PULL_DIR / "fisd_issue.parquet")
n_rows = fisd_issue.select(pl.len()).collect().item()
cols = fisd_issue.collect_schema().names()
print(f"Rows: {n_rows:,}  |  Columns: {len(cols)}")
print(f"Column names: {cols}")
Rows: 710,098  |  Columns: 19
Column names: ['complete_cusip', 'issue_id', 'issue_name', 'issuer_id', 'foreign_currency', 'coupon_type', 'coupon', 'convertible', 'asset_backed', 'rule_144a', 'bond_type', 'private_placement', 'interest_frequency', 'dated_date', 'day_count_basis', 'offering_date', 'maturity', 'principal_amt', 'offering_amt']

Example rows#

fisd_issue.head(5).collect()
shape: (5, 19)
complete_cusipissue_idissue_nameissuer_idforeign_currencycoupon_typecouponconvertibleasset_backedrule_144abond_typeprivate_placementinterest_frequencydated_dateday_count_basisoffering_datematurityprincipal_amtoffering_amt
strf64strf64strstrf64strstrstrstrstrstrstrstrstrstrf64f64
"000361AA3"1.0"NT"3.0"N""F"9.5"N""N""N""CDEB""N""2""1989-11-01""30/360""1989-10-24""2001-11-01"1000.065000.0
"000361AB1"2.0"NT"3.0"N""F"7.25"N""N""N""CDEB""N""2""1993-10-15""30/360""1993-10-12""2003-10-15"1000.050000.0
"00077DAB5"3.0"MTN"40263.0"N""F"4.15"N""N""N""CMTN""N""2""1994-01-14""30/360""1994-01-07""1996-01-12"1000.0100000.0
"00077DAF6"4.0"SUB DEP NT SER B"40263.0"N""F"8.25"N""N""N""USBN""N""2""1994-08-02""30/360""1994-07-27""2009-08-01"1000.0200000.0
"00077TAA2"5.0"SUB DEP NT SER B"40263.0"N""F"7.75"N""N""N""CDEB""N""2""1993-05-15""30/360""1993-05-20""2023-05-15"1000.0250000.0

Date range#

fisd_issue.select(
    pl.col("offering_date").min().alias("earliest_offering"),
    pl.col("offering_date").max().alias("latest_offering"),
    pl.col("maturity").min().alias("earliest_maturity"),
    pl.col("maturity").max().alias("latest_maturity"),
).collect()
shape: (1, 4)
earliest_offeringlatest_offeringearliest_maturitylatest_maturity
strstrstrstr
"1894-10-02""2030-06-29""1990-01-01""2122-04-15"

Top 15 bond types#

(
    fisd_issue.group_by("bond_type")
    .agg(pl.len().alias("n"))
    .sort("n", descending=True)
    .head(15)
    .collect()
)
shape: (15, 2)
bond_typen
stru32
"CMTN"214264
"ADEB"163272
"CMTZ"135731
"CDEB"64624
"AMTN"41488
"ARNT"5865
"ASPZ"3935
"CCUR"2891
"EMTN"2450
"USBL"2071

Bond types β€” bar chart#

import matplotlib.pyplot as plt

bt = (
    fisd_issue.group_by("bond_type")
    .agg(pl.len().alias("n"))
    .sort("n", descending=True)
    .head(15)
    .collect()
)

fig, ax = plt.subplots(figsize=(10, 3))
ax.bar(bt["bond_type"].to_list(), bt["n"].to_list(), color="steelblue", alpha=0.8)
ax.set_ylabel("Count")
ax.set_title("FISD β€” top 15 bond types")
ax.tick_params(axis="x", rotation=45)
fig.tight_layout()
plt.show()
../_images/ea39d54570bb1dcc0bc942e8ebeb323524c0eb0b4f2414db2be45603f71ff557.png

4. FISD Issuer Data β€” pulled/fisd_issuer.parquet#

Source script: pull_fisd.py | WRDS table: fisd.fisd_mergedissuer

One row per bond issuer. Provides the country of domicile and SIC industry classification for each issuer. Linked to the issue table via issuer_id.

Key columns:

Column

Description

issuer_id

Unique issuer identifier (joins to issue table)

country_domicile

Country of domicile (ISO code)

sic_code

Standard Industrial Classification code

fisd_issuer = pl.scan_parquet(PULL_DIR / "fisd_issuer.parquet")
n_rows_issuer = fisd_issuer.select(pl.len()).collect().item()
cols_issuer = fisd_issuer.collect_schema().names()
print(f"Rows: {n_rows_issuer:,}  |  Columns: {len(cols_issuer)}")
print(f"Column names: {cols_issuer}")
Rows: 17,200  |  Columns: 3
Column names: ['issuer_id', 'country_domicile', 'sic_code']

Example rows#

fisd_issuer.head(5).collect()
shape: (5, 3)
issuer_idcountry_domicilesic_code
f64strstr
3.0"USA""3720"
4.0"NLD""6029"
6.0"USA"null
7.0"USA""7382"
8.0"USA"null

Top 10 countries of domicile#

(
    fisd_issuer.group_by("country_domicile")
    .agg(pl.len().alias("n"))
    .sort("n", descending=True)
    .head(10)
    .collect()
)
shape: (10, 2)
country_domicilen
stru32
"USA"11895
"CAN"724
null594
"GBR"456
"NLD"284
"CYM"230
"BRA"185
"FRA"163
"DEU"154
"AUS"145

5. FISD Ratings (Moody’s) β€” pulled/fisd_ratings_moodys.parquet#

Source script: pull_fisd.py | WRDS table: fisd.fisd_ratings (filtered to rating_type = 'MR')

Historical Moody’s credit ratings for each bond issue. One row per rating event (initial assignment, upgrade, downgrade).

Key columns:

Column

Description

issue_id

FISD issue identifier

rating_date

Date of the rating action

rating

Moody’s rating string (e.g. Aaa, Baa1)

ratings_mdy = pl.scan_parquet(PULL_DIR / "fisd_ratings_moodys.parquet")
n_rows_mdy = ratings_mdy.select(pl.len()).collect().item()
cols_mdy = ratings_mdy.collect_schema().names()
print(f"Rows: {n_rows_mdy:,}  |  Columns: {len(cols_mdy)}")
print(f"Column names: {cols_mdy}")
Rows: 1,482,164  |  Columns: 3
Column names: ['issue_id', 'rating_date', 'rating']

Example rows#

ratings_mdy.head(5).collect()
shape: (5, 3)
issue_idrating_daterating
f64strstr
1.0"1993-05-26""Baa3"
1.0"1997-01-23""Baa2"
1.0"1999-12-29""Baa2"
1.0"2000-12-15""Baa2"
1.0"2001-04-06""Baa3"

Date range#

ratings_mdy.select(
    pl.col("rating_date").min().alias("earliest"),
    pl.col("rating_date").max().alias("latest"),
).collect()
shape: (1, 2)
earliestlatest
strstr
"1895-10-01""2090-10-04"

6. FISD Ratings (S&P) β€” pulled/fisd_ratings_sp.parquet#

Source script: pull_fisd.py | WRDS table: fisd.fisd_ratings (filtered to rating_type = 'SPR')

Historical S&P credit ratings for each bond issue. One row per rating event (initial assignment, upgrade, downgrade).

Key columns:

Column

Description

issue_id

FISD issue identifier

rating_date

Date of the rating action

rating

S&P rating string (e.g. AAA, BBB+)

ratings_sp = pl.scan_parquet(PULL_DIR / "fisd_ratings_sp.parquet")
n_rows_sp = ratings_sp.select(pl.len()).collect().item()
cols_sp = ratings_sp.collect_schema().names()
print(f"Rows: {n_rows_sp:,}  |  Columns: {len(cols_sp)}")
print(f"Column names: {cols_sp}")
Rows: 1,185,202  |  Columns: 3
Column names: ['issue_id', 'rating_date', 'rating']

Example rows#

ratings_sp.head(5).collect()
shape: (5, 3)
issue_idrating_daterating
f64strstr
1.0"1995-02-22""BBB-"
1.0"1997-02-07""BBB-"
1.0"1997-07-03""BBB"
1.0"2001-09-21""BBB"
2.0"1995-02-22""BBB-"

Date range#

ratings_sp.select(
    pl.col("rating_date").min().alias("earliest"),
    pl.col("rating_date").max().alias("latest"),
).collect()
shape: (1, 2)
earliestlatest
strstr
"1895-10-01""2091-06-14"

7. FISD Redemption β€” pulled/fisd_redemption.parquet#

Source script: pull_fisd.py | WRDS table: fisd.fisd_mergedredemption

Callable/redeemable flags from the FISD redemption table. Indicates whether a bond issue is callable.

Key columns:

Column

Description

issue_id

FISD issue identifier

callable

Flag indicating whether the bond is callable

redemp = pl.scan_parquet(PULL_DIR / "fisd_redemption.parquet")
n_rows_redemp = redemp.select(pl.len()).collect().item()
cols_redemp = redemp.collect_schema().names()
print(f"Rows: {n_rows_redemp:,}  |  Columns: {len(cols_redemp)}")
print(f"Column names: {cols_redemp}")
Rows: 431,041  |  Columns: 2
Column names: ['issue_id', 'callable']

Example rows#

redemp.head(5).collect()
shape: (5, 2)
issue_idcallable
f64str
4.0"Y"
5.0"N"
6.0"N"
8.0"N"
9.0"N"

8. Liu-Wu Treasury Yields β€” pulled/liu_wu_yields.parquet#

Source script: pull_liu_wu_yields.py | Source: Liu-Wu (Google Sheets)

Zero-coupon Treasury yield curve at selected maturities, daily-resampled with forward fill. Yields are expressed as decimals (e.g. 0.035 = 3.5%). Used by Stage 1 to compute credit spreads.

Key columns:

Column

Description

trd_exctn_dt

Date

oneyr

1-year zero-coupon yield

twoyr

2-year zero-coupon yield

fiveyr

5-year zero-coupon yield

sevyr

7-year zero-coupon yield

tenyr

10-year zero-coupon yield

twentyr

20-year zero-coupon yield

thirtyr

30-year zero-coupon yield

liu_wu = pl.scan_parquet(PULL_DIR / "liu_wu_yields.parquet")
n_rows_lw = liu_wu.select(pl.len()).collect().item()
cols_lw = liu_wu.collect_schema().names()
print(f"Rows: {n_rows_lw:,}  |  Columns: {len(cols_lw)}")
print(f"Column names: {cols_lw}")
Rows: 9,102  |  Columns: 8
Column names: ['trd_exctn_dt', 'oneyr', 'twoyr', 'fiveyr', 'sevyr', 'tenyr', 'twentyr', 'thirtyr']

Example rows#

liu_wu.head(5).collect()
shape: (5, 8)
trd_exctn_dtoneyrtwoyrfiveyrsevyrtenyrtwentyrthirtyr
datetime[ns]f64f64f64f64f64f64f64
2000-01-31 00:00:000.0623270.0653360.0667210.0663680.0660830.0656920.062096
2000-02-01 00:00:000.0624360.0651380.0667010.0659910.065650.0649330.061348
2000-02-02 00:00:000.0622050.0652130.0664710.0656980.0651730.0640040.059505
2000-02-03 00:00:000.0617710.0646410.0654220.0647950.0641320.0631260.057421
2000-02-04 00:00:000.0624660.0656320.0667510.0658780.0651240.0640410.058841

Date range#

liu_wu.select(
    pl.col("trd_exctn_dt").min().alias("earliest"),
    pl.col("trd_exctn_dt").max().alias("latest"),
).collect()
shape: (1, 2)
earliestlatest
datetime[ns]datetime[ns]
2000-01-31 00:00:002024-12-31 00:00:00

9. OSBAP Corporate Bond Returns β€” pulled/corporate_bond_returns.parquet#

Source script: pull_open_source_bond.py

Monthly bond-level returns from the Open Source Bond Asset Pricing (OSBAP) project at openbondassetpricing.com. Used only for validation: compared against Stage 1 output in test_stage1_vs_open_source.py to benchmark duration, convexity, credit spreads, and other metrics. Not consumed by any processing stage.

Key columns (partial β€” see OSBAP documentation for full schema):

Column

Description

cusip

9-character CUSIP

date

Month-end date

bond_ret

Total bond return

bond_prc

Bond price

mod_dur

Modified duration

convexity

Convexity

credit_spread

Credit spread (OAS)

corp = pl.scan_parquet(PULL_DIR / "corporate_bond_returns.parquet")
n_rows_corp = corp.select(pl.len()).collect().item()
cols_corp = corp.collect_schema().names()
print(f"Rows: {n_rows_corp:,}  |  Columns: {len(cols_corp)}")
print(f"Column names: {cols_corp}")
Rows: 1,859,546  |  Columns: 140
Column names: ['cusip', 'date', 'issuer_cusip', 'permno', 'permco', 'gvkey', '144a', 'country', 'call', 'ret_vw', 'ret_vw_bgn', 'hprd', 'lib', 'libd', 'ret_type', 'spc_rat', 'mdc_rat', 'ff17num', 'ff30num', 'fce_val', 'mcap_s', 'mcap_e', 'tret', 'rfret', 'dt_s', 'dt_e', 'dt_s_bgn', 'dt_e_bgn', 'hprd_bgn', 'igap_bgn', 'sig_dt', 'sig_gap', 'tmat', 'age', 'ytm', 'cs', 'md_dur', 'convx', 'bbtm', 'sze', 'val_hz', 'val_hz_dts', 'val_ipr', 'val_ipr_dts', 'dcs6', 'cs_mu12_1', 'pi', 'ami', 'ami_v', 'lix', 'ilq', 'roll', 'spd_abs', 'spd_rel', 'cs_sprd', 'ar_sprd', 'p_zro', 'p_fht', 'vov', 'dvol', 'dskew', 'dkurt', 'db_mkt', 'dvol_sys', 'dvol_idio', 'rvol', 'rsj', 'rsk', 'rkt', 'b_vix', 'b_dvixd', 'b_mktrf_mkt', 'b_mktb_mkt', 'ivol_mkt', 'ivol_bbw', 'b_mktbx_dcapm', 'b_term_dcapm', 'b_dvix_va', 'b_dvix_vp', 'ivol_vp', 'b_psb_m', 'b_amd_m', 'b_dvix', 'b_cpi_vol6', 'b_dunc', 'b_unc', 'b_dunc3', 'b_dunc6', 'b_duncr', 'b_duncf', 'b_dcredit', 'b_credit', 'b_dcpi', 'b_cptlt', 'b_rvol', 'b_rsj', 'b_psb', 'b_amd', 'b_illiq', 'b_dvix_dn', 'b_dvix_up', 'b_coskew', 'iskew', 'b_defb', 'b_termb', 'b_drf', 'b_crf', 'b_lrf', 'b_mktb', 'b_lvl', 'b_ysp', 'b_mktb_dn', 'b_mktb_up', 'b_epu', 'b_epum', 'b_eput', 'sysmom3_1', 'sysmom6_1', 'sysmom12_1', 'idimom3_1', 'idimom6_1', 'idimom12_1', 'mom3_1', 'mom6_1', 'mom9_1', 'mom12_1', 'mom12_7', 'ltr48_12', 'ltr30_6', 'ltr24_3', 'imom1', 'imom3_1', 'imom12_1', 'iltr48_12', 'iltr30_6', 'iltr24_3', 'var_90', 'es_90', 'var_95', 'str']

Example rows#

corp.head(5).collect()
shape: (5, 140)
cusipdateissuer_cusippermnopermcogvkey144acountrycallret_vwret_vw_bgnhprdliblibdret_typespc_ratmdc_ratff17numff30numfce_valmcap_smcap_etretrfretdt_sdt_edt_s_bgndt_e_bgnhprd_bgnigap_bgnsig_dtsig_gaptmatageytmcsmd_durb_defbb_termbb_drfb_crfb_lrfb_mktbb_lvlb_yspb_mktb_dnb_mktb_upb_epub_epumb_eputsysmom3_1sysmom6_1sysmom12_1idimom3_1idimom6_1idimom12_1mom3_1mom6_1mom9_1mom12_1mom12_7ltr48_12ltr30_6ltr24_3imom1imom3_1imom12_1iltr48_12iltr30_6iltr24_3var_90es_90var_95str
catdatetime[ns]catf64f64f64i8cati8f32f32i16f32f32cati32i32i8i8i64f32f32f32f64datetime[ns]datetime[ns]datetime[ns]datetime[ns]i16i8datetime[ns]f64f32f32f32f32f32f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f32
"000336AE7"2002-08-31 00:00:00"000336"75188.0nullnull0"USA"10.007252null260.0068790.006951"standard"11162910000098.71494399.4308470.0151090.00142002-07-26 00:00:002002-08-30 00:00:00nullnullnullnull2002-08-28 00:00:002.05.7554.2600960.0643160.0283954.6360.2257940.512650.188553-0.285256null0.725394-1.1803180.2234420.7393710.7143150.1416840.0629320.275439-0.011314-0.027213-0.0328320.0271590.0021560.0123630.015825-0.025425-0.016064-0.021164-0.005107-0.025394-0.025394-0.0253940.0090890.0167140.0700780.2599470.250910.2028240.0321740.0352030.0382330.051041
"000336AE7"2002-09-30 00:00:00"000336"75188.0nullnull0"USA"1-0.05466-0.06118520nullnull"standard"11162910000099.43084793.9959560.0264860.00142002-08-30 00:00:002002-09-30 00:00:002002-09-03 00:00:002002-09-30 00:00:001912002-09-23 00:00:005.05.674.3449690.0753010.0450194.5150.4690640.1522570.2164050.198128null0.604760.16412-0.5351160.2592190.886356-0.069745-0.0381210.1476580.006335-0.02346-0.0398960.0179420.0076890.0369770.0244-0.016358-0.013613-0.0038060.014823-0.025394-0.026544-0.0265440.0157360.0249680.0752540.2633040.2136760.1968030.0382330.0464460.05466-0.002936
"000336AE7"2002-10-31 00:00:00"000336"75188.0nullnull0"USA"10.051999null230.0156250.015362"standard"11162910000093.99595698.883682-0.0023110.00142002-09-30 00:00:002002-10-31 00:00:00nullnullnullnull2002-10-25 00:00:004.05.5854.4298420.0742740.0408764.4290.527977-0.0077710.3413650.338517null0.509735-0.5345770.3861130.40560.593340.0401340.0025590.2676050.0189750.006474-0.006987-0.066383-0.044732-0.043342-0.047804-0.039212-0.063172-0.0513810.020156-0.026544-0.009971-0.0099710.0142740.0302340.0676410.2959790.2373940.206790.0382330.0464460.054660.066941
"000336AE7"2002-11-30 00:00:00"000336"75188.0nullnull0"USA"10.0805570.06420821nullnull"standard"11162910000098.883682103.431-0.0152730.00122002-10-31 00:00:002002-11-27 00:00:002002-11-01 00:00:002002-11-27 00:00:001812002-11-26 00:00:001.05.5114.5037650.0687480.0352594.3750.990068-0.3552570.5855840.646973null0.926985-1.1500491.6564770.2619131.3876020.1212450.0617120.4099040.0074090.0300240.015147-0.01007-0.009588-0.025985-0.0055030.017561-0.023759-0.014382-0.024863-0.009971-0.009971-0.0027910.0012630.0155550.0756890.2606120.2622130.2073620.0382330.0464460.054660.04528
"000336AE7"2003-04-30 00:00:00"000336"75188.0nullnull0"USA"10.0678990.02680221nullnull"standard"11162910000088.45065394.456390.0017010.0012003-03-31 00:00:002003-04-29 00:00:002003-04-01 00:00:002003-04-29 00:00:001912003-04-28 00:00:001.05.0924.9226560.092880.0631544.0191.545703-0.7479950.8577281.005559null1.419267-2.7399214.1003571.2174541.5465060.7337720.2032130.7646710.078630.1615760.1888280.1967340.118380.0755990.2910570.2932640.3049340.272392-0.032609-0.002791-0.057299-0.0082790.0054710.0184390.1206620.2570040.2448470.1969720.0382330.0464460.054660.05373

Date range#

date_cols_corp = [c for c in cols_corp if "date" in c.lower() or "dt" in c.lower()]
if date_cols_corp:
    corp.select(
        pl.col(date_cols_corp[0]).min().alias("earliest"),
        pl.col(date_cols_corp[0]).max().alias("latest"),
    ).collect()

10. OSBAP Linker (Fang, 2025) β€” pulled/osbap_linker.parquet#

Source script: pull_open_source_bond.py | Source: openbondassetpricing.com

Core Stage 1 input. Bond-firm link file constructed by Chuck Fang (cite: Fang, 2025, β€œMonetary Policy Amplification through Bond Fund Flows”, working paper). Maps issuer CUSIPs to equity identifiers (PERMNO, PERMCO, GVKEY) in real time.

Many bonds are issued through subsidiaries that do not share the parent’s identifiers (e.g. Shell plc vs Shell International Finance BV), and 19% of CUSIP6s change ultimate parents over time. Fang constructs the link using the CUSIP-ticker mapping from ICE and TRACE, the ticker-GVKEY mapping from Compustat Snapshot, and the CRSP-Compustat Link from CRSP, with extensive hand corrections.

Coverage: 93% (99%) of USD corporate debentures by amount outstanding (trading volume), including 29,253 foreign/private issuers and 8,862 144A bonds not covered by the WRDS Bond-CRSP Link.

Used in Step 7 of stage1_pipeline.py: forward-filled and left-joined on issuer CUSIP + year-month to attach equity IDs to every bond record.

linker = pl.scan_parquet(PULL_DIR / "osbap_linker.parquet")
n_rows_linker = linker.select(pl.len()).collect().item()
cols_linker = linker.collect_schema().names()
print(f"Rows: {n_rows_linker:,}  |  Columns: {len(cols_linker)}")
print(f"Column names: {cols_linker}")
Rows: 21,845,380  |  Columns: 5
Column names: ['ISSUER_CUSIP', 'GVKEY', 'PERMNO', 'PERMCO', 'YYYYMM']

Example rows#

linker.head(5).collect()
shape: (5, 5)
ISSUER_CUSIPGVKEYPERMNOPERMCOYYYYMM
strf64f64f64i64
"000021"23052.078736.029189.0199701
"000021"23052.078736.029189.0199702
"000021"23052.078736.029189.0199703
"000021"23052.078736.029189.0199704
"000021"23052.078736.029189.0199705

11. OSBAP Treasury Bond Returns β€” pulled/treasury_bond_returns.parquet#

Source script: pull_open_source_bond.py

Treasury bond returns from the OSBAP project. Currently documentation only β€” not consumed by any processing or validation stage in the pipeline.

treas = pl.scan_parquet(PULL_DIR / "treasury_bond_returns.parquet")
n_rows_treas = treas.select(pl.len()).collect().item()
cols_treas = treas.collect_schema().names()
print(f"Rows: {n_rows_treas:,}  |  Columns: {len(cols_treas)}")
print(f"Column names: {cols_treas}")
Rows: 2,381,340  |  Columns: 5
Column names: ['DATE', 'CUSIP', 'tr_return', 'tr_ytm_match', 'tau']

Example rows#

treas.head(5).collect()
shape: (5, 5)
DATECUSIPtr_returntr_ytm_matchtau
i64strf64f64f64
20020731"000325AA8"nullnull0.545205
20020831"000325AA8"0.09257null0.460274
20021031"000325AA8"nullnull0.293151
20021130"000325AA8"0.215493null0.210959
20021231"000325AA8"0.109617null0.126027

Date range#

date_cols_treas = [c for c in cols_treas if "date" in c.lower() or "dt" in c.lower()]
if date_cols_treas:
    treas.select(
        pl.col(date_cols_treas[0]).min().alias("earliest"),
        pl.col(date_cols_treas[0]).max().alias("latest"),
    ).collect()

12. TRACE 144A β€” pulled/trace_144a/#

Source script: pull_trace_144a.py | WRDS table: trace.trace_btds144a

Trade-level data for Rule 144A private placement bonds. These are bonds sold to qualified institutional buyers under SEC Rule 144A. Data starts July 2002. Same column schema as TRACE Standard.

Key columns:

Column

Description

cusip_id

9-character CUSIP bond identifier

trd_exctn_dt

Trade execution date

trd_exctn_tm

Trade execution time

rptd_pr

Reported price (per $100 par)

ascii_rptd_vol_tx

Reported volume (text field)

yld_pt

Yield at time of trade

trc_st

TRACE status

side

Trade side

msg_seq_nb

Message sequence number

trace_144a_dir = PULL_DIR / "trace_144a"
trace_144a = pl.scan_parquet(
    trace_144a_dir / "**/*.parquet",
    hive_partitioning=True,
    allow_missing_columns=True,
)
n_rows_144a = trace_144a.select(pl.len()).collect().item()
cols_144a = trace_144a.collect_schema().names()
print(f"Rows: {n_rows_144a:,}  |  Columns: {len(cols_144a)}")
print(f"Column names: {cols_144a}")
Rows: 20,357,504  |  Columns: 21
Column names: ['cusip_id', 'bond_sym_id', 'bsym', 'trd_exctn_dt', 'trd_exctn_tm', 'msg_seq_nb', 'trc_st', 'wis_fl', 'cmsn_trd', 'ascii_rptd_vol_tx', 'rptd_pr', 'yld_pt', 'asof_cd', 'side', 'diss_rptg_side_cd', 'orig_msg_seq_nb', 'orig_dis_dt', 'rptg_party_type', 'contra_party_type', 'year', 'month']
/var/folders/l3/tj6vb0ld2ys1h939jz0qrfrh0000gn/T/ipykernel_72101/1672643635.py:2: DeprecationWarning: the parameter `allow_missing_columns` for `scan_parquet` is deprecated. Use the parameter `missing_columns` instead and pass one of `('insert', 'raise')`.
(Deprecated in version 1.30.0)
  trace_144a = pl.scan_parquet(

Example rows#

trace_144a.head(5).collect()
shape: (5, 21)
cusip_idbond_sym_idbsymtrd_exctn_dttrd_exctn_tmmsg_seq_nbtrc_stwis_flcmsn_trdascii_rptd_vol_txrptd_pryld_ptasof_cdsidediss_rptg_side_cdorig_msg_seq_nborig_dis_dtrptg_party_typecontra_party_typeyearmonth
strstrstrdatetimef64strstrstrstrf64f64strstrstrf64datestrstri64i64
"053773BE6""CAR5025361""BBG00WGJGS47"2003-08-0318:13:033463.0"N"nullnull"1MM+"93.07.041207"A""B"null6785.02020-08-06"D""C"20038
"053773BE6""CAR5025361""BBG00WGJGS47"2003-08-0318:13:036785.0"M"nullnull"1MM+"93.07.041207"A""B"nullnullnull"D""C"20038
"045086AK1""ASWD4936471""BBG00RCZ3WB3"2003-10-1711:58:443303.0"M"nullnull"1MM+"92.758.701616"A""S"nullnullnull"D""C"200310
"423074AS2""BRK4206318""BBG0080F6Q77"2004-01-2809:04:15185.0"G"null"N""1000000"100.2281254.839958"A"null"B"nullnullnullnull20041
"225433AQ4""CS4353757""BBG00CNXL6V4"2004-10-0109:37:29462.0"M"null"N""5000000"100.415null"A""B"nullnullnull"D""C"200410

Date range#

trace_144a.select(
    pl.col("trd_exctn_dt").min().alias("earliest"),
    pl.col("trd_exctn_dt").max().alias("latest"),
).collect()
shape: (1, 2)
earliestlatest
datedate
2003-08-032026-02-13

Trades per month β€” TRACE 144A#

import matplotlib.dates as mdates

t144a_monthly = (
    trace_144a.with_columns(pl.col("trd_exctn_dt").alias("date"))
    .group_by(pl.col("date").dt.truncate("1mo"))
    .agg(pl.len().alias("n_trades"))
    .sort("date")
    .collect()
)

fig, ax = plt.subplots(figsize=(10, 3))
ax.bar(
    t144a_monthly["date"].to_list(),
    t144a_monthly["n_trades"].to_list(),
    width=25,
    color="darkorange",
    alpha=0.8,
)
ax.xaxis.set_major_locator(mdates.YearLocator(2))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
ax.set_ylabel("Trades")
ax.set_title("TRACE 144A β€” trades per month")
fig.tight_layout()
plt.show()
../_images/84ec422b883c1ef3ffcffbe54ad631a3448efaa4dd3a4f37629b2709145d2d87.png

13. TRACE Enhanced β€” pulled/trace_enhanced/#

Source script: pull_trace_enhanced.py | WRDS table: trace.trace_enhanced

Trade-level corporate bond transaction data from FINRA’s TRACE system (Enhanced version). Data starts July 2002 and is stored as a hive-partitioned data lake (trace_enhanced/year=YYYY/month=MM/data.parquet). Each row is one reported trade with price, volume, and counterparty details.

Key columns:

Column

Description

cusip_id

9-character CUSIP bond identifier

trd_exctn_dt

Trade execution date

trd_exctn_tm

Trade execution time

rptd_pr

Reported price (per $100 par)

entrd_vol_qt

Entered volume (par value)

yld_pt

Yield at time of trade

trc_st

TRACE status (T, X, C, W, Y)

msg_seq_nb

Message sequence number

rpt_side_cd

Reporting side (B=buy, S=sell)

cntra_mp_id

Counterparty market participant ID

trace_enh_dir = PULL_DIR / "trace_enhanced"
trace_enh = pl.scan_parquet(
    trace_enh_dir / "**/*.parquet",
    hive_partitioning=True,
    allow_missing_columns=True,
)
n_rows_enh = trace_enh.select(pl.len()).collect().item()
cols_enh = trace_enh.collect_schema().names()
print(f"Rows: {n_rows_enh:,}  |  Columns: {len(cols_enh)}")
print(f"Column names: {cols_enh}")
Rows: 423,792,457  |  Columns: 21
Column names: ['cusip_id', 'bond_sym_id', 'trd_exctn_dt', 'trd_exctn_tm', 'days_to_sttl_ct', 'lckd_in_ind', 'wis_fl', 'sale_cndtn_cd', 'msg_seq_nb', 'trc_st', 'trd_rpt_dt', 'trd_rpt_tm', 'entrd_vol_qt', 'rptd_pr', 'yld_pt', 'asof_cd', 'orig_msg_seq_nb', 'rpt_side_cd', 'cntra_mp_id', 'year', 'month']
/var/folders/l3/tj6vb0ld2ys1h939jz0qrfrh0000gn/T/ipykernel_72101/3951940352.py:2: DeprecationWarning: the parameter `allow_missing_columns` for `scan_parquet` is deprecated. Use the parameter `missing_columns` instead and pass one of `('insert', 'raise')`.
(Deprecated in version 1.30.0)
  trace_enh = pl.scan_parquet(

Example rows (first partition)#

trace_enh.head(5).collect()
shape: (5, 21)
cusip_idbond_sym_idtrd_exctn_dttrd_exctn_tmdays_to_sttl_ctlckd_in_indwis_flsale_cndtn_cdmsg_seq_nbtrc_sttrd_rpt_dttrd_rpt_tmentrd_vol_qtrptd_pryld_ptasof_cdorig_msg_seq_nbrpt_side_cdcntra_mp_idyearmonth
strstrdatei64strstrstrstrstrstrdatei64f64f64f64strstrstrstri64i64
"000325AA8""AAFM.GA"2002-07-0149893000000000"000"null"N""@""0009218""T"2002-07-01498950000000001e6118.2517.965641nullnull"B""D"20027
"000325AA8""AAFM.GA"2002-07-0149916000000000"000"null"N""@""0009239""T"2002-07-01499170000000001e6118.2517.965641nullnull"S""D"20027
"000325AA8""AAFM.GA"2002-07-0149916000000000nullnull"N""@""0009267""C"2002-07-01499170000000001e6118.2517.965641null"0009239""S""D"20027
"000325AA8""AAFM.GA"2002-07-0164233000000000"000"null"N""A""0020341""T"2002-07-01642370000000001e6118.2517.965641nullnull"B""D"20027
"000325AA8""AAFM.GA"2002-07-0164277000000000nullnull"N""A""0020358""C"2002-07-01642790000000002e6118.2517.965641null"0020359""S""D"20027

Date range#

trace_enh.select(
    pl.col("trd_exctn_dt").min().alias("earliest"),
    pl.col("trd_exctn_dt").max().alias("latest"),
).collect()
shape: (1, 2)
earliestlatest
datedate
2002-07-012025-03-31

Summary statistics β€” price & volume#

trace_enh.select(
    pl.col("rptd_pr").mean().alias("mean_price"),
    pl.col("rptd_pr").std().alias("std_price"),
    pl.col("rptd_pr").median().alias("median_price"),
    pl.col("entrd_vol_qt").mean().alias("mean_volume"),
    pl.col("entrd_vol_qt").median().alias("median_volume"),
).collect()
shape: (1, 5)
mean_pricestd_pricemedian_pricemean_volumemedian_volume
f64f64f64f64f64
97.82455136.784462100.0654531.8569825000.0

Trades per month#

enh_monthly = (
    trace_enh.with_columns(pl.col("trd_exctn_dt").alias("date"))
    .group_by(pl.col("date").dt.truncate("1mo"))
    .agg(pl.len().alias("n_trades"))
    .sort("date")
    .collect()
)

fig, ax = plt.subplots(figsize=(10, 3))
ax.bar(
    enh_monthly["date"].to_list(),
    enh_monthly["n_trades"].to_list(),
    width=25,
    color="steelblue",
    alpha=0.8,
)
ax.xaxis.set_major_locator(mdates.YearLocator(2))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
ax.set_ylabel("Trades")
ax.set_title("TRACE Enhanced β€” trades per month")
fig.tight_layout()
plt.show()
../_images/f4f5942191356dc894091f2f5da5154c739e8f027560b2b3f7b94b958912645a.png

14. TRACE Standard β€” pulled/trace_standard/#

Source script: pull_trace_standard.py | WRDS table: trace.trace

Trade-level corporate bond transaction data from the Standard TRACE feed. Available from October 2024 onward. Same hive-partitioned storage as Enhanced. Uses a slightly different column set (e.g. ascii_rptd_vol_tx instead of entrd_vol_qt, side instead of rpt_side_cd).

Key columns:

Column

Description

cusip_id

9-character CUSIP bond identifier

trd_exctn_dt

Trade execution date

trd_exctn_tm

Trade execution time

rptd_pr

Reported price (per $100 par)

ascii_rptd_vol_tx

Reported volume (text field)

yld_pt

Yield at time of trade

trc_st

TRACE status

side

Trade side

msg_seq_nb

Message sequence number

trace_std_dir = PULL_DIR / "trace_standard"
trace_std = pl.scan_parquet(
    trace_std_dir / "**/*.parquet",
    hive_partitioning=True,
    allow_missing_columns=True,
)
n_rows_std = trace_std.select(pl.len()).collect().item()
cols_std = trace_std.collect_schema().names()
print(f"Rows: {n_rows_std:,}  |  Columns: {len(cols_std)}")
print(f"Column names: {cols_std}")
Rows: 47,256,465  |  Columns: 21
Column names: ['cusip_id', 'bond_sym_id', 'bsym', 'trd_exctn_dt', 'trd_exctn_tm', 'msg_seq_nb', 'trc_st', 'wis_fl', 'cmsn_trd', 'ascii_rptd_vol_tx', 'rptd_pr', 'yld_pt', 'asof_cd', 'side', 'diss_rptg_side_cd', 'orig_msg_seq_nb', 'orig_dis_dt', 'rptg_party_type', 'contra_party_type', 'year', 'month']
/var/folders/l3/tj6vb0ld2ys1h939jz0qrfrh0000gn/T/ipykernel_72101/3284768942.py:2: DeprecationWarning: the parameter `allow_missing_columns` for `scan_parquet` is deprecated. Use the parameter `missing_columns` instead and pass one of `('insert', 'raise')`.
(Deprecated in version 1.30.0)
  trace_std = pl.scan_parquet(

Example rows#

trace_std.head(5).collect()
shape: (5, 21)
cusip_idbond_sym_idbsymtrd_exctn_dttrd_exctn_tmmsg_seq_nbtrc_stwis_flcmsn_trdascii_rptd_vol_txrptd_pryld_ptasof_cdsidediss_rptg_side_cdorig_msg_seq_nborig_dis_dtrptg_party_typecontra_party_typeyearmonth
strstrstrdatei64f64strstrstrstrf64f64strstrstrf64datestrstri64i64
"02209SBD4""MO4797929""BBG00NB2Z4S5"2024-01-0178261000000000327.0"M"nullnull"3000"100.0494.787982"A""S"nullnullnull"D""D"20241
"254687FS0""DIS4969022""BBG00SMMG454"2024-01-0181521000000000322.0"M"nullnull"5000"97.4484.872966"A""S"nullnullnull"D""D"20241
"30303M8J4""FB5522214""BBG01BV23038"2024-01-0181331000000000330.0"M"nullnull"10000"92.5314.940009"A""S"nullnullnull"D""D"20241
"404280DH9""HBC5458071""BBG0192W27G1"2024-01-018122500000000026.0"M"nullnull"2000000"101.0null"A""S"nullnullnull"D""A"20241
"458140BX7""INTC5238071""BBG01236J676"2024-01-0181367000000000337.0"M"nullnull"10000"69.7074.993"A""S"nullnullnull"D""D"20241

Date range#

trace_std.select(
    pl.col("trd_exctn_dt").min().alias("earliest"),
    pl.col("trd_exctn_dt").max().alias("latest"),
).collect()
shape: (1, 2)
earliestlatest
datedate
2024-01-012026-02-13