CRSP Treasury Data Oveview#

The CRSP US Treasury Database is a comprehensive source of historical Treasury security data maintained by the Center for Research in Security Prices at the University of Chicago’s Booth School of Business. This section provides an essential overview of this critical dataset, which will serve as the foundation for our subsequent lecture on replicating the Gürkaynak, Sack, and Wright (2006) yield curve estimation methodology.

Database Description#

The CRSP US Treasury Database provides complete historical descriptive information and market data for US Treasury securities, including:

  • Prices

  • Returns

  • Accrued interest

  • Yields

  • Durations

The database begins in 1961 and is updated monthly. The data is available in ASCII, Excel, and SAS formats.

Let’s examine a sample of the data using Python. First, we’ll import the necessary modules:

import pull_CRSP_treasury
import pandas as pd

from settings import config
DATA_DIR = config("DATA_DIR")

# Load the consolidated CRSP Treasury data
df = pull_CRSP_treasury.load_CRSP_treasury_consolidated(data_dir=DATA_DIR)

# Display the first few rows and data info
df[['caldt', 'tcusip', 'price', 'tcouprt', 'tdyld']].head()
caldt tcusip price tcouprt tdyld
0 1970-01-02 912810AE 101.052989 4.0 0.000212
1 1970-01-05 912810AE 101.148098 4.0 0.000205
2 1970-01-06 912810AE 101.190217 4.0 0.000199
3 1970-01-07 912810AE 101.201087 4.0 0.000202
4 1970-01-08 912810AE 101.211957 4.0 0.000204
# Data structure information:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2444137 entries, 0 to 2444136
Data columns (total 21 columns):
 #   Column             Dtype         
---  ------             -----         
 0   kytreasno          float64       
 1   kycrspid           object        
 2   tcusip             object        
 3   caldt              datetime64[ns]
 4   tdatdt             datetime64[ns]
 5   tmatdt             datetime64[ns]
 6   tfcaldt            int64         
 7   tdbid              float64       
 8   tdask              float64       
 9   tdaccint           float64       
 10  tdyld              float64       
 11  price              float64       
 12  tcouprt            float64       
 13  itype              float64       
 14  original_maturity  float64       
 15  years_to_maturity  float64       
 16  tdduratn           float64       
 17  tdretnua           float64       
 18  days_to_maturity   int64         
 19  callable           bool          
 20  run                int64         
dtypes: bool(1), datetime64[ns](3), float64(12), int64(3), object(2)
memory usage: 375.3+ MB

In the consolidated dataframe that I have constructed from the data in pull_CRSP_treasury.py, the following variables are present:

Identification:

  • kytreasno: Treasury record identifier (unique numeric ID)

  • kycrspid: CRSP-assigned unique identifier

  • tcusip: Treasury CUSIP number (unique security identifier)

Dates:

  • caldt: Quote date (date of price observation)

  • tdatdt: Date dated (original issue date when interest starts accruing)

  • tmatdt: Maturity date (when principal is repaid)

  • tfcaldt: First call date (0 if not callable)

Prices and Yields:

  • tdbid: Daily bid price (clean price)

  • tdask: Daily ask price (clean price)

  • tdaccint: Daily accrued interest since last coupon

  • tdyld: Daily yield (bond equivalent yield)

  • price: Dirty price (clean price + accrued interest)

Issue Characteristics:

  • tcouprt: Coupon rate (annual)

  • itype: Type of issue (1: bonds, 2: notes)

  • callable: Boolean flag indicating if security is callable

Maturity Measures:

  • original_maturity: Years to maturity at issuance

  • years_to_maturity: Remaining years to maturity

  • days_to_maturity: Remaining days to maturity

Trading Information:

  • tdduratn: Duration (price sensitivity to yield changes)

  • tdretnua: Unadjusted return (price change + accrued interest)

Market Structure:

  • run: Security’s “runness” (0: off-the-run, 1: first off-the-run, 2: on-the-run)

Data Sources#

The price data in CRSP comes from several sources over time:

Note

Historical Data Sources:

  • 1961-1989: Manual input with double-entry verification

  • 1990-1996 (Sept): Department of Commerce electronic bulletin board

  • 1996 (Oct)-2009 (Jan): GovPX, Inc.

  • 2009 (Feb)-Present: ICAP

Supplementary data comes from various official sources:

  • Amount outstanding: Monthly Statement of the Public Debt (Treasury Department)

  • Publicly held debt: US Treasury Bulletin (quarterly)

  • Money rates: Federal Reserve

  • Issue characteristics (dates, coupon payments, eligibility, tax status): US Treasury Department

Database Structure#

The database consists of three primary files:

  1. Calendar File

    • Contains daily quote dates and delivery dates

    • Includes derived date information

  2. Master File

    • Contains security identification and characteristics

    • Includes header information for each security

  3. Cross-Sectional Files

    • Contains time series of prices, yields, and returns

    • Organized by date

Reference:
    CRSP US TREASURY DATABASE GUIDE
    https://www.crsp.org/wp-content/uploads/guides/CRSP_US_Treasury_Database_Guide_for_SAS_ASCII_EXCEL_R.pdf

Data Description:
    TFZ_DLY ( DAILY TIME SERIES ITEMS)
        kytreasno: TREASURY RECORD IDENTIFIER
        kycrspid: CRSP-ASSIGNED UNIQUE ID
        caldt: QUOTATION DATE
        tdbid: DAILY BID
        tdask: DAILY ASK
        tdaccint: DAILY SERIES OF TOTAL ACCRUED INTEREST
        tdyld: DAILY SERIES OF PROMISED DAILY YIELD

    TFZ_ISS (ISSUE DESCRIPTIONS)
        tcusip: TREASURY CUSIP
        tdatdt: DATE DATED BY TREASURY
        tmatdt: MATURITY DATE AT TIME OF ISSUE
        tcouprt: COUPON RATE
        itype: TYPE OF ISSUE (1: NONCALLABLE BONDS, 2: NONCALLABLE NOTES)

See for example, this function in pull_CRSP_treasury.py:

def pull_CRSP_treasury_consolidated(
    start_date="1970-01-01",
    end_date=datetime.today().strftime("%Y-%m-%d"),
    wrds_username=WRDS_USERNAME,
):
    """Pull consolidated CRSP Treasury data with all relevant fields.

    Includes fields from these tables:
    - tfz_dly (daily quotes): bid/ask prices, accrued interest, yields
    - tfz_iss (issue info): CUSIP, dates, coupon rates, issue types

    Price Terminology:
    - Clean Price = (bid + ask)/2 = quoted price without accrued interest
    - Dirty Price = Clean Price + Accrued Interest = actual transaction price

    Date Fields:
    - Quote Date (caldt): Date of the price observation
    - Date Dated (tdatdt): Original issue date when interest starts accruing
    - Maturity Date (tmatdt): Date when the security matures

    Returns:
    - Unadjusted Return (tdretnua): Simple price change plus accrued interest,
      not accounting for tax effects or reinvestment

    Fields are based on CRSP Daily US Treasury Database Guide specifications.
    """

    query = f"""
    SELECT 
        -- Identification
        tfz.kytreasno, tfz.kycrspid, iss.tcusip,
        
        -- Dates
        tfz.caldt,                  -- Quote date: Date of price observation
        iss.tdatdt,                 -- Date dated: Original issue date when interest starts accruing
        iss.tmatdt,                 -- Maturity date: When principal is repaid
        iss.tfcaldt,                -- First call date (0 if not callable)
        
        -- Prices and Yields
        tfz.tdbid,                  -- Bid price (clean)
        tfz.tdask,                  -- Ask price (clean)
        tfz.tdaccint,               -- Accrued interest since last coupon
        tfz.tdyld,                  -- Bond equivalent yield
        ((tfz.tdbid + tfz.tdask) / 2.0 + tfz.tdaccint) AS price,  
                                    -- Dirty price (clean price + accrued interest)
        
        -- Issue Characteristics
        iss.tcouprt,                -- Coupon rate (annual)
        iss.itype,                  -- Type of issue (1: bonds, 2: notes)
        ROUND((iss.tmatdt - iss.tdatdt) / 365.0) AS original_maturity,  
                                    -- Original maturity at issuance
        
        -- Additional Derived Fields
        ROUND((iss.tmatdt - tfz.caldt) / 365.0) AS years_to_maturity,  
                                    -- Remaining time to maturity
        
        -- Trading info (if available)
        tfz.tdduratn,               -- Duration: Price sensitivity to yield changes
        tfz.tdretnua                -- Return (unadjusted): Simple price change + accrued interest
        
    FROM 
        crspm.tfz_dly AS tfz
    LEFT JOIN 
        crspm.tfz_iss AS iss 
    ON 
        tfz.kytreasno = iss.kytreasno AND 
        tfz.kycrspid = iss.kycrspid
    WHERE 
        tfz.caldt BETWEEN '{start_date}' AND '{end_date}' AND 
        iss.itype IN (1, 2)  -- Only include Treasury bonds (1) and notes (2)
    """

    db = wrds.Connection(wrds_username=wrds_username)
    df = db.raw_sql(query, date_cols=["caldt", "tdatdt", "tmatdt", "tfcaldt"])
    df["days_to_maturity"] = (df["tmatdt"] - df["caldt"]).dt.days
    df["tfcaldt"] = df["tfcaldt"].fillna(0)
    df["callable"] = df["tfcaldt"] != 0  # Add boolean callable flag
    db.close()
    df = df.reset_index(drop=True)
    return df

Data Quality and Filtering#

CRSP implements several quality control measures:

  1. Price verification:

    • Historical manual prices were double-entered and compared

    • Multiple iteration verification process

    • Logical filters for data cleaning

  2. Spread calculation:

    • Pre-2009: Imputed from representative quotes

    • Post-2009: Actual bid-ask spreads from ICAP

  3. Secondary verification:

    • Cross-checking against sources like Wall Street Journal

    • Internal consistency checks with each release

Let’s examine the bid-ask spreads in our data:

df['spread'] = df['tdask'] - df['tdbid']
print("\nBid-Ask Spread Statistics:")
df.groupby(df['caldt'].dt.year)['spread'].describe().tail()
Bid-Ask Spread Statistics:
count mean std min 25% 50% 75% max
caldt
2020 78765.0 0.048491 0.017829 0.007812 0.039062 0.046875 0.062500 2.070312
2021 80632.0 0.047548 0.013401 0.007812 0.039062 0.046875 0.062500 0.218750
2022 81005.0 0.041079 0.016491 0.007812 0.031250 0.039062 0.046875 0.328125
2023 83652.0 0.039486 0.017600 0.007812 0.031250 0.031250 0.046875 0.546875
2024 85415.0 0.039168 0.019176 0.000028 0.031250 0.031250 0.046875 0.500000
df.groupby(df['caldt'].dt.year)['spread'].describe().head()
count mean std min 25% 50% 75% max
caldt
1970 9461.0 0.293911 0.234581 0.03125 0.125 0.25 0.25 1.0
1971 9924.0 0.321119 0.279893 0.03125 0.125 0.25 0.25 1.0
1972 10442.0 0.322744 0.244351 0.03125 0.250 0.25 0.25 1.0
1973 10733.0 0.327052 0.261802 0.03125 0.250 0.25 0.25 1.0
1974 11051.0 0.387957 0.328981 0.06250 0.250 0.25 0.25 1.0

Looking Ahead#

This overview provides the foundation for our next lecture, where we will examine how GĂĽrkaynak, Sack, and Wright (2006) used CRSP Treasury data to construct their influential yield curve estimates. We will implement their methodology using Python, focusing on their specific data filtering approach and estimation techniques.

Note: The CRSP Treasury Database serves as the primary source for many influential academic studies in fixed income research. Understanding its structure and evolution is crucial for empirical work in this area.

References#

Center for Research in Security Prices. (2010). Daily US Treasury Database Guide. See https://wrds-www.wharton.upenn.edu/pages/support/manuals-and-overviews/crsp/treasuries/