Example: Connecting to the WRDS Platform With Python#
The wrds
Python package is a tool designed to facilitate data retrieval from the Wharton Research Data Services (WRDS) database.
It provides direct access to the WRDS database, allowing users to programmatically query and retrieve data.
The package supports a simple Python API as well as the ability to send raw SQL queries.
Retrieved data can be easily converted into Pandas DataFrames.
Secure access to the WRDS database is managed through user authentication, ensuring data security and compliance with usage policies.
Installation#
The package can be installed via pip:
pip install wrds
To use the package, one would typically import it in Python, authenticate with WRDS credentials, and then proceed with data queries and analysis. This notebook walks through some of these basic features.
Usage#
Here are some helpful links to learn how to use the wrds
Python package:
Import and Establish Connection#
Establish connection with WRDS server.
Log in using your WRDS username and password.
Set up a pgpass
file to store the info. This will allow you to access WRDS without supplying your password each time, as long as you supply your username (which we are doing below via environment variables). We want to be able to access WRDS without a password so that we can automate the query.
import config
from pathlib import Path
OUTPUT_DIR = Path(config.OUTPUT_DIR)
DATA_DIR = Path(config.DATA_DIR)
WRDS_USERNAME = config.WRDS_USERNAME
import wrds
db = wrds.Connection(wrds_username=WRDS_USERNAME)
Loading library list...
Done
# The `obs` keyword argument limits the number of rows returned.
# You can omit it to get all rows.
df = db.get_table(library='crsp', table='msf', columns=['cusip', 'permno', 'date', 'shrout', 'prc', 'ret', 'retx'], obs=10)
df
cusip | permno | date | shrout | prc | ret | retx | |
---|---|---|---|---|---|---|---|
0 | 68391610 | 10000.0 | 1985-12-31 | NaN | NaN | NaN | NaN |
1 | 68391610 | 10000.0 | 1986-01-31 | 3680.0 | -4.375000 | NaN | NaN |
2 | 68391610 | 10000.0 | 1986-02-28 | 3680.0 | -3.250000 | -0.257143 | -0.257143 |
3 | 68391610 | 10000.0 | 1986-03-31 | 3680.0 | -4.437500 | 0.365385 | 0.365385 |
4 | 68391610 | 10000.0 | 1986-04-30 | 3793.0 | -4.000000 | -0.098592 | -0.098592 |
5 | 68391610 | 10000.0 | 1986-05-30 | 3793.0 | -3.109375 | -0.222656 | -0.222656 |
6 | 68391610 | 10000.0 | 1986-06-30 | 3793.0 | -3.093750 | -0.005025 | -0.005025 |
7 | 68391610 | 10000.0 | 1986-07-31 | 3793.0 | -2.843750 | -0.080808 | -0.080808 |
8 | 68391610 | 10000.0 | 1986-08-29 | 3793.0 | -1.093750 | -0.615385 | -0.615385 |
9 | 68391610 | 10000.0 | 1986-09-30 | 3793.0 | -1.031250 | -0.057143 | -0.057143 |
Exploring the Data#
As described here,
Data at WRDS is organized in a hierarchical manner by vendor (e.g. crsp), referred to at the top-level as libraries. Each library contains a number of component tables or datasets (e.g. dsf) which contain the actual data in tabular format, with column headers called variables (such as date, askhi, bidlo, etc).
You can analyze the structure of the data through its metadata using the wrds module, as outlined in the following steps:
List all available libraries at WRDS using list_libraries()
Select a library to work with, and list all available datasets within that library using list_tables()
Select a dataset, and list all available variables (column headers) within that dataset using describe_table()
NOTE: When referencing library and dataset names, you must use all lowercase.
Alternatively, a comprehensive list of all WRDS libraries is available at the Dataset List. This resource provides a listing of each library, their component datasets and variables, as well as a tabular database preview feature, and is helpful in establishing the structure of the data you’re looking for in an easy manner from a Web browser.
Determine the libraries available at WRDS:
sorted(db.list_libraries())[0:10]
['aha_sample',
'ahasamp',
'audit',
'audit_acct_os',
'audit_audit_comp',
'audit_common',
'audit_corp_legal',
'auditsmp',
'auditsmp_all',
'bank']
This will list all libraries available at WRDS in alphabetical order. Though all libraries will be shown, you must have a valid, current subscription for a library in order to access it via Python, just as with SAS or any other supported programming language at WRDS. You will receive an error message indicating this if you attempt to query a table to which your institution does not have access.
To determine the datasets within a given library:
db.list_tables(library="crsp")[0:10]
['acti',
'asia',
'asib',
'asic',
'asio',
'asix',
'bmdebt',
'bmheader',
'bmpaymts',
'bmquotes']
Where ‘library; is a dataset, such as crsp or comp, as returned from step 1 above.
To determine the column headers (variables) within a given dataset:
db.describe_table(library="crsp", table="msf")
Approximately 4922867 rows in crsp.msf.
name | nullable | type | comment | |
---|---|---|---|---|
0 | cusip | True | VARCHAR(8) | None |
1 | permno | True | DOUBLE_PRECISION | None |
2 | permco | True | DOUBLE_PRECISION | None |
3 | issuno | True | DOUBLE_PRECISION | None |
4 | hexcd | True | DOUBLE_PRECISION | None |
5 | hsiccd | True | DOUBLE_PRECISION | None |
6 | date | True | DATE | None |
7 | bidlo | True | DOUBLE_PRECISION | None |
8 | askhi | True | DOUBLE_PRECISION | None |
9 | prc | True | DOUBLE_PRECISION | None |
10 | vol | True | DOUBLE_PRECISION | None |
11 | ret | True | DOUBLE_PRECISION | None |
12 | bid | True | DOUBLE_PRECISION | None |
13 | ask | True | DOUBLE_PRECISION | None |
14 | shrout | True | DOUBLE_PRECISION | None |
15 | cfacpr | True | DOUBLE_PRECISION | None |
16 | cfacshr | True | DOUBLE_PRECISION | None |
17 | altprc | True | DOUBLE_PRECISION | None |
18 | spread | True | DOUBLE_PRECISION | None |
19 | altprcdt | True | DATE | None |
20 | retx | True | DOUBLE_PRECISION | None |
Where ‘library’ is a dataset such as crsp as returned from #1 above and ‘table’ is a component database within that library, such as msf, as returned from query #2 above. Remember that both the library and the dataset are case-sensitive, and must be all-lowercase.
Alternatively, a comprehensive list of all WRDS libraries is available via the WRDS Dataset List. This online resource provides a listing of each library, their component datasets and variables, as well as a tabular database preview feature, and is helpful in establishing the structure of the data you’re looking for in an easy, web-friendly manner.
By examining the metadata available to us – the structure of the data – we’ve determined how to reference the data we’re researching, and what variables are available within that data. We can now perform our actual research, creating data queries, which are explored in depth in the next section.
Querying WRDS Data#
Continue the walkthrough provided here: https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-python/querying-wrds-data-python/
Using get_table
#
# The `obs` keyword argument limits the number of rows returned.
# You can omit it to get all rows.
df = db.get_table(library='crsp', table='msf', columns=['cusip', 'permno', 'date', 'shrout', 'prc', 'ret', 'retx'], obs=10)
df
cusip | permno | date | shrout | prc | ret | retx | |
---|---|---|---|---|---|---|---|
0 | 68391610 | 10000.0 | 1985-12-31 | NaN | NaN | NaN | NaN |
1 | 68391610 | 10000.0 | 1986-01-31 | 3680.0 | -4.375000 | NaN | NaN |
2 | 68391610 | 10000.0 | 1986-02-28 | 3680.0 | -3.250000 | -0.257143 | -0.257143 |
3 | 68391610 | 10000.0 | 1986-03-31 | 3680.0 | -4.437500 | 0.365385 | 0.365385 |
4 | 68391610 | 10000.0 | 1986-04-30 | 3793.0 | -4.000000 | -0.098592 | -0.098592 |
5 | 68391610 | 10000.0 | 1986-05-30 | 3793.0 | -3.109375 | -0.222656 | -0.222656 |
6 | 68391610 | 10000.0 | 1986-06-30 | 3793.0 | -3.093750 | -0.005025 | -0.005025 |
7 | 68391610 | 10000.0 | 1986-07-31 | 3793.0 | -2.843750 | -0.080808 | -0.080808 |
8 | 68391610 | 10000.0 | 1986-08-29 | 3793.0 | -1.093750 | -0.615385 | -0.615385 |
9 | 68391610 | 10000.0 | 1986-09-30 | 3793.0 | -1.031250 | -0.057143 | -0.057143 |
Using raw_sql
#
df = db.raw_sql(
"""
SELECT
cusip, permno, date, shrout, prc, ret, retx
FROM
crsp.msf
LIMIT 10
""",
date_cols=['date']
)
df
cusip | permno | date | shrout | prc | ret | retx | |
---|---|---|---|---|---|---|---|
0 | 68391610 | 10000.0 | 1985-12-31 | NaN | NaN | NaN | NaN |
1 | 68391610 | 10000.0 | 1986-01-31 | 3680.0 | -4.375000 | NaN | NaN |
2 | 68391610 | 10000.0 | 1986-02-28 | 3680.0 | -3.250000 | -0.257143 | -0.257143 |
3 | 68391610 | 10000.0 | 1986-03-31 | 3680.0 | -4.437500 | 0.365385 | 0.365385 |
4 | 68391610 | 10000.0 | 1986-04-30 | 3793.0 | -4.000000 | -0.098592 | -0.098592 |
5 | 68391610 | 10000.0 | 1986-05-30 | 3793.0 | -3.109375 | -0.222656 | -0.222656 |
6 | 68391610 | 10000.0 | 1986-06-30 | 3793.0 | -3.093750 | -0.005025 | -0.005025 |
7 | 68391610 | 10000.0 | 1986-07-31 | 3793.0 | -2.843750 | -0.080808 | -0.080808 |
8 | 68391610 | 10000.0 | 1986-08-29 | 3793.0 | -1.093750 | -0.615385 | -0.615385 |
9 | 68391610 | 10000.0 | 1986-09-30 | 3793.0 | -1.031250 | -0.057143 | -0.057143 |
df = db.raw_sql(
"""
SELECT
a.gvkey, a.datadate, a.tic, a.conm, a.at, a.lt, b.prccm, b.cshoq
FROM
comp.funda a
JOIN
comp.secm b ON a.gvkey = b.gvkey AND a.datadate = b.datadate
WHERE
a.tic = 'IBM' AND
a.datafmt = 'STD' AND
a.consol = 'C' AND
a.indfmt = 'INDL'
LIMIT 10
"""
)
df
gvkey | datadate | tic | conm | at | lt | prccm | cshoq | |
---|---|---|---|---|---|---|---|---|
0 | 006066 | 1962-12-31 | IBM | INTL BUSINESS MACHINES CORP | 2112.301 | 731.700 | 389.9996 | NaN |
1 | 006066 | 1963-12-31 | IBM | INTL BUSINESS MACHINES CORP | 2373.857 | 782.119 | 506.9994 | NaN |
2 | 006066 | 1964-12-31 | IBM | INTL BUSINESS MACHINES CORP | 3309.152 | 1055.072 | 409.4995 | NaN |
3 | 006066 | 1965-12-31 | IBM | INTL BUSINESS MACHINES CORP | 3744.917 | 1166.771 | 498.9991 | NaN |
4 | 006066 | 1966-12-31 | IBM | INTL BUSINESS MACHINES CORP | 4660.777 | 1338.149 | 371.4997 | NaN |
5 | 006066 | 1967-12-31 | IBM | INTL BUSINESS MACHINES CORP | 5598.668 | 1767.067 | 626.9995 | NaN |
6 | 006066 | 1968-12-31 | IBM | INTL BUSINESS MACHINES CORP | 6743.430 | 2174.291 | 314.9998 | NaN |
7 | 006066 | 1969-12-31 | IBM | INTL BUSINESS MACHINES CORP | 7389.957 | 2112.967 | 364.4994 | 113.717 |
8 | 006066 | 1970-12-31 | IBM | INTL BUSINESS MACHINES CORP | 8539.047 | 2591.909 | 317.7496 | 114.587 |
9 | 006066 | 1971-12-31 | IBM | INTL BUSINESS MACHINES CORP | 9576.219 | 2933.837 | 336.4996 | 115.534 |
params = {"tickers": ("0015B", "0030B", "0032A", "0033A", "0038A")}
df = db.raw_sql(
"""
SELECT
datadate, gvkey, cusip
FROM comp.funda
WHERE
tic IN %(tickers)s
LIMIT 10
""",
params=params,
)
df
datadate | gvkey | cusip | |
---|---|---|---|
0 | 1982-10-31 | 002484 | 121579932 |
1 | 1983-10-31 | 002484 | 121579932 |
2 | 1984-10-31 | 002484 | 121579932 |
3 | 1985-10-31 | 002484 | 121579932 |
4 | 1986-10-31 | 002484 | 121579932 |
5 | 1987-10-31 | 002484 | 121579932 |
6 | 1988-10-31 | 002484 | 121579932 |
7 | 1989-06-30 | 002484 | 121579932 |
8 | 1990-06-30 | 002484 | 121579932 |
9 | 1991-06-30 | 002484 | 121579932 |
Misc#
data = db.get_row_count('djones', 'djdaily')
data
28073
db.close()