HW Guide: Wage Growth During the Recession

HW Guide: Wage Growth During the Recession#

This note is designed to help you understand the full scope of the task. The homework will have you fill in the code to implement some of the functions used here. This notebook will put together all of those functions so as to visualize the final result. If you want, you can first follow the steps of this notebook to help you write the code. You can then refactor the code by moving it into separate files and functions that are covered by the unit tests. The point of this document is to help guide you on the homework.

Before you start this assignment, you’ll have to install the weightedstats package. You can do this by running pip install weightedstats in a terminal. For more information, see here: https://pypi.org/project/weightedstats/ . We will use this package to compute a weighted median. The package is included in the requirements.txt for the HW repo, so you can just use that if you’d like.

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import weightedstats
import seaborn as sns
sns.set()

import config
from pathlib import Path
import load_fred

DATA_DIR = Path(config.DATA_DIR)

import load_cps
import wage_growth_analytics

# Change default pandas display options
pd.options.display.max_columns = 25
pd.options.display.max_rows = 500
pd.options.display.max_colwidth = 100
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Change default figure size
plt.rcParams['figure.figsize'] = 6, 5

1. Introduction#

Wages at the aggregate level have been found to rise during the recent recession (or at least not decline). This is surprising given that aggregate labor conditions were so weak. Many economists have viewed this as a puzzle. The goal in this question is to assess whether this is a puzzle, given that the composition of workers are changing over time. We begin by establishing the facts from which this puzzle arises.

The great recession is generally defined as having lasted from December 2007 to June 2009. For example, see the corresponding article on Wikipedia: https://en.wikipedia.org/wiki/Great_Recession_in_the_United_States

The Great Recession in the United States was a severe financial crisis combined with a deep recession. While the recession officially lasted from December 2007 to June 2009, it took several years for the economy to recover to pre-crisis levels of employment and output.

Consider the following graph. This graph plots the median weekly real earnings for wage and salary workers ages 16 years and older. The shaded areas in the plots denote officially defined periods of recession, as defined by NBER. Notice that wages during the period of the great recession appear to be increasing. In this document we will be investigating this fact.

Median Real Earnings

Step 1. Retrieve and plot this data from FRED#

Find the data for the FRED plot shown above. You can find it using Google and the title.
Also, find the data for the same figure, except restricted to “men”. This is called “Employed full time: Median usual weekly real earnings: Wage and salary workers: 16 years and over: Men (LES1252881900Q)” To eliminate other potentially confounding factors, we will try to narrow the demographic group we are studying.
Make a simple plot of both of these data series using Pandas’ built-in plot methods. Restrict the time period to show from 2000 to 2017.
No need to make the shaded recession bars in your plot. I’m just looking for the simplest, easiest plot possible. Just be sure it’s labelled (e.q, use plt.title('Median ...')).

df_fred = load_fred.pull_all()
load_fred.series_descriptions

{'LES1252881600Q': 'Employed full time: Median usual weekly real earnings: Wage and salary workers: 16 years and over',
 'LES1252881900Q': 'Employed full time: Median usual weekly real earnings: Wage and salary workers: 16 years and over: Men'}

df_fred = df_fred.rename(columns={'LES1252881600Q':'Median wages (all)'})
df_fred = df_fred.rename(columns={'LES1252881900Q':'Median wages (men)'})

df_fred['2000':'2017'].plot()
plt.title('Median usual weekly real earnings');

../_images/78a54b5ec30103519b48775eec37fefcfa557cbc60edfbac533b72c6d3bb3812.png

df_fred.plot()
plt.title('Median usual weekly real earnings');

../_images/801ff09f05c103977078a4a76238051eb314da92380ef6bfe248d1e413eba39f.png

2. Demographic data from US Current Population Survey#

To study this puzzling fact, we will analyze how changing demographics might affect our aggregate measures of wages. One of our biggest concerns is that as employment rates fell, they fell more for low educated workers than high educated workers. That means the average wages we are measuring will actually be comparing different types of workers in 2010 relative to 2006. This could potentially explain why wages are rising. We will attempt to compensate for this by constructing a demographically adjusted time series for wages.

Game Plan#

Here is a summary of what we will do. I will give you more specific instructions afterwards to walk you through this.

Use data from the 2000-2023 March CPS (downloaded from IPUMS CPS web site) to examine the time series (annual) trend in both nominal and real wages for 25-54 year old men not living in group quarters. (You need to download the group quarters variable). Wages will be defined as annual earnings (last year) divided by annual hours (last year). Annual hours can be computed by multiplying weeks worked last year by usual hours worked (last year). You should do this for a sample of all workers with a positive wage last year. Typically, you can convert to real wages using any deflator you wish. Here, I will require using the CPI99 variable to convert to 1999 dollars. (Note: Many economists like to use the June CPI-U. June is in the middle of the year and, for that reason, avoids some issues with seasonality. The choice of year doesn’t matter.)

Note about ASEC#

Be sure to download only the ASEC samples. The ASEC supplement is the March CPS. The ASEC includes extra variables that are not included in the basic monthly CPS survey. From Wikipedia, the ASEC is described as the following:,

Since 1948, the CPS has included supplemental questions (at first, in April; later, in March) on income received in the previous calendar year, which are used to estimate the data on income and work experience. These data are the source of the annual Census Bureau report on income, poverty, and health insurance coverage…

Downloading the March samples from the “Basic Monthly” samples doesn’t give us what we need. See the included screenshot. The arrow points out the fact that these variables are missing in the March basic monthly samples but are available in the ASEC samples.

asec_v_basic_monthly

Step 2. Download and Read the CPS Data from IPUMS#

When you download the data from IPUMS, only download the data that you need. This is important, because the data gets very big very fast when you start adding other variables. IPUMS will automatically add certain variables to your “cart”. Go ahead and remove the ones that you don’t need. This will make things much easier. The list of variables that you need are listed below. Make sure to stick to just those variables. You will fill in the code in the file called load_cps.py that will clean the raw IPUMS CPS data. This fill assumes that you have downloaded the data manually from the IPUMS CPS website as described above and saved it in the directory ../data/manual relative to the source file. Relative to the project home directory, this is ./data/manual. Recall that we are putting it in the manual directory instead of the pulled directory, because the data was not automatically pulled from the source.

The data should contain the following variables:

YEAR (Survey year)
GQ (Group Quarters status)
CPI99 (CPI-U adjustment factor to 1999 dollars)
ASECWT (Annual Social and Economic Supplement Weight)
AGE (Age)
SEX (Sex)
LABFORCE (Labor force status)
EDUC (Educational attainment recode)
WKSWORK1 (Weeks worked last year)
UHRSWORKLY (Usual hours worked per week (last yr))
INCWAGE (Wage and salary income)

When you’re finished, run the following code to see how much memory your data set is using, the list of variables, and the data types each variable is saved as:

DATA_DIR = config.DATA_DIR
MANUAL_DATA_DIR = config.MANUAL_DATA_DIR
df = load_cps.load_raw(MANUAL_DATA_DIR)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4752652 entries, 0 to 4752651
Data columns (total 12 columns):
 #   Column      Dtype         
---  ------      -----         
 0   YEAR        datetime64[ns]
 1   GQ          category      
 2   CPI99       float64       
 3   CPSIDV      int64         
 4   ASECWT      float64       
 5   AGE         int64         
 6   SEX         category      
 7   LABFORCE    category      
 8   EDUC        category      
 9   WKSWORK1    int64         
 10  UHRSWORKLY  int64         
 11  INCWAGE     int64         
dtypes: category(4), datetime64[ns](1), float64(2), int64(5)
memory usage: 308.2 MB

As you can see, this results in

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4752652 entries, 0 to 4752651
Data columns (total 12 columns):
 #   Column      Dtype         
---  ------      -----         
 0   YEAR        datetime64[ns]
 1   GQ          category      
 2   CPI99       float64       
 3   CPSIDV      int64         
 4   ASECWT      float64       
 5   AGE         int64         
 6   SEX         category      
 7   LABFORCE    category      
 8   EDUC        category      
 9   WKSWORK1    int64         
 10  UHRSWORKLY  int64         
 11  INCWAGE     int64         
dtypes: category(4), datetime64[ns](1), float64(2), int64(5)
memory usage: 308.2 MB

If you use restrict your data set to exactly these variables and follow the instructions precisely, you should be able to replicate the exact numbers that I get. (For example, don’t drop missing data until the instructions tell you to.)

df.describe()

	YEAR	CPI99	CPSIDV	ASECWT	AGE	WKSWORK1	UHRSWORKLY	INCWAGE
count	4752652	4752652.000	4752652.000	4752652.000	4752652.000	4752652.000	4752652.000	4752652.000
mean	2011-05-27 05:38:11.579523328	0.772	130839243535879.688	1683.323	35.410	23.563	511.442	23325037.674
min	2000-01-01 00:00:00	0.547	0.000	21.090	0.000	0.000	1.000	0.000
25%	2005-01-01 00:00:00	0.694	0.000	817.808	16.000	0.000	40.000	0.000
50%	2011-01-01 00:00:00	0.764	200412011918016.000	1506.560	35.000	6.000	70.000	28000.000
75%	2017-01-01 00:00:00	0.882	201303039922031.000	2211.361	52.000	52.000	999.000	142000.000
max	2024-01-01 00:00:00	1.000	202403069322011.000	44423.830	90.000	52.000	999.000	99999999.000
std	NaN	0.117	95862336975917.375	1224.023	22.360	24.782	480.190	42264633.688

When filling in the missing values, keep in mind the codings for each variable. Consult the associated codebook and IPUMS documentation for complete information. (I’ve included the codebook from my data extract in the file named: cps_00007_codebook.pdf)

For your convenience, I’ve reproduced some information about some of our variables below.

UHRSWORKLY#

For UHRSWORKLY: Usual hours worked per week (last yr) https://cps.ipums.org/cps-action/variables/UHRSWORKLY#codes_section

UHRSWORKLY is a 2-digit numeric value.

99 = 99 hours or more 
999 = NIU (Not in universe)

WKSWORK1 reports the number of weeks, in single weeks, that the respondent worked for profit, pay, or as an unpaid family worker during the preceding calendar year. Respondents were prompted to count weeks in which they worked for even a few hours and to include paid vacation and sick leave as work. Information on weeks worked during the preceding year is available in the form of intervals for 1962 forward in the WKSWORK2 variable.

INCWAGE#

For INCWAGE: Wage and salary income https://cps.ipums.org/cps-action/variables/INCWAGE#description_section

Codes
99999999 = N.I.U. (Not in Universe). 
99999998 = Missing.

INCWAGE indicates each respondent’s total pre-tax wage and salary income–that is, money received as an employee–for the previous calendar year. Amounts are expressed as they were reported to the interviewer; users must adjust for inflation using Consumer Price Index adjustment factors.

For ASEC samples 1988-onward, INCWAGE is derived from a Census recode variable. The topcoded components of INCWAGE are OINCWAGE and INCLONGJ. OINCWAGE is always a component of INCWAGE. When SRCEARN indicates that INCLONGJ is earned from wage and salary, INCLONGJ is an additional component of INCWAGE.

CPI99#

For CPI99: CPI99 is a 4-digit variable with three implied decimal places. https://cps.ipums.org/cps-action/variables/CPI99#description_section

CPI99 provides the CPI-U multiplier (available from the Bureau of Labor Statistics) to convert dollar figures to constant 1999 dollars. (This corresponds to the dollar amounts in the 2000 CPS, which inquired about income in 1999.) Multiplying dollar amounts by this variable (which is constant within years) will render them comparable across time and thus suitable for multivariate analysis.

For more information on how to use CPI99, please see the IPUMS inflation adjustment page.

LABFORCE#

https://cps.ipums.org/cps-action/variables/LABFORCE#codes_section

Codes:
 NIU
 No, not in the labor force
 Yes, in the labor force

LABFORCE is a dichotomous variable indicating whether the respondent participated in the labor force during the preceding week. See EMPSTAT for a more detailed employment status variable. Those coded as “yes” in LABFORCE were either: were at work; held a job but were temporarily absent from work due to factors like vacation or illness; were seeking work; or were temporarily laid off from a job during the reference period.

Because the CPS is designed to measure unemployment in the civilian population, the original dichotomous employment status variable in the survey classifies members of the armed forces as NIU (Not in universe).

Step 3. Recode the missing data#

Each variable in the CPS data that we downloaded has specific codes used to represent missing data. There is also a difference between data that is missing and data that is considered “NIU”, meaning “Not in Universe.” Consult the documentation for the meaning of each of the codes for each of the variables. Convert all missing values and NIU values to np.nan.

For example, you should include the following code for INCWAGE:

# INCWAGE: Missing values
# 99999999 = N.I.U. (Not in Universe). 
# 99999998 = Missing.
df.loc[df['INCWAGE'] == 99999999, 'INCWAGE'] = np.nan
df.loc[df['INCWAGE'] == 99999998, 'INCWAGE'] = np.nan

Be sure to look at the encodings on IPUMS to make sure you’re properly coding all variables in your dataset—not just the ones that I listed above (not necessarily comprehensive).

df = load_cps.load_clean(MANUAL_DATA_DIR)
df.describe()

	YEAR	CPI99	CPSIDV	ASECWT	AGE	WKSWORK1	UHRSWORKLY	INCWAGE
count	4752652	4752652.000	4752652.000	4752652.000	4752652.000	4752652.000	2412892.000	3645116.000
mean	2011-05-27 05:38:11.579523328	0.772	130839243535879.688	1683.323	35.410	23.563	38.661	28034.241
min	2000-01-01 00:00:00	0.547	0.000	21.090	0.000	0.000	1.000	0.000
25%	2005-01-01 00:00:00	0.694	0.000	817.808	16.000	0.000	36.000	0.000
50%	2011-01-01 00:00:00	0.764	200412011918016.000	1506.560	35.000	6.000	40.000	12000.000
75%	2017-01-01 00:00:00	0.882	201303039922031.000	2211.361	52.000	52.000	40.000	40000.000
max	2024-01-01 00:00:00	1.000	202403069322011.000	44423.830	90.000	52.000	99.000	2099999.000
std	NaN	0.117	95862336975917.375	1224.023	22.360	24.782	12.122	51670.193

Step 4. Select the desired subsample of the data#

As we mentioned, we want to narrow the demographic group under consideration. Drop all observations from df that don’t match the following conditions.

Use data from the 2000-2023 March CPS for 25-54 year old men not living in group quarters. Only consider individuals with a strictly positive income from wages.

For example, your code might start like this:

# GQ = 0 for vacant units, 1 for Households, 2 for group quarters
df = df[df['GQ'] == 1]

df = wage_growth_analytics.s04_subsample(df)
df.describe()

	YEAR	CPI99	CPSIDV	ASECWT	AGE	WKSWORK1	UHRSWORKLY	INCWAGE
count	792456	792456.000	792456.000	792456.000	792456.000	792456.000	792456.000	792456.000
mean	2011-02-01 15:06:39.915200512	0.777	129026943321680.312	1699.807	39.518	49.204	43.340	59718.730
min	2000-01-01 00:00:00	0.547	0.000	25.330	25.000	1.000	1.000	1.000
25%	2005-01-01 00:00:00	0.694	0.000	825.017	33.000	52.000	40.000	27000.000
50%	2011-01-01 00:00:00	0.764	200401045625016.000	1516.550	40.000	52.000	40.000	45000.000
75%	2017-01-01 00:00:00	0.882	201212064455521.000	2209.660	46.000	52.000	47.000	70000.000
max	2024-01-01 00:00:00	1.000	202403069322011.000	44423.830	54.000	52.000	99.000	2099999.000
std	NaN	0.118	96391382223115.375	1257.945	8.325	8.436	9.618	68994.539

Step 5. Construct new variables#

Now, construct new variables called real_incwage, annual_hours, real_wage, and in_labor_force.

real_incwage is the CPI99 deflator times annual earnings (last year)
annual_hours is the weeks worked last year times the usual hours worked per week (last year).
real_wage is the annual earnings (last year) divided by annual hours (last year). Be sure to set real_wage equal to zero whenever annual_hours is equal to zero. (I use the following code:

# To prevent infinite wages
df.loc[df['annual_hours'] <= 0, 'real_wage'] = 0

in_labor_force is a boolean variable that is True if the individual was coded as “Yes, in the labor force.” Use the LABFORCE variable for this.

df = wage_growth_analytics.s05_new_vars(df)
df.describe()

	YEAR	CPI99	CPSIDV	ASECWT	AGE	WKSWORK1	UHRSWORKLY	INCWAGE	real_incwage	annual_hours	real_wage
count	792456	792456.000	792456.000	792456.000	792456.000	792456.000	792456.000	792456.000	792456.000	792456.000	792456.000
mean	2011-02-01 15:06:39.915200512	0.777	129026943321680.312	1699.807	39.518	49.204	43.340	59718.730	45214.235	2145.548	22.022
min	2000-01-01 00:00:00	0.547	0.000	25.330	25.000	1.000	1.000	1.000	0.726	1.000	0.000
25%	2005-01-01 00:00:00	0.694	0.000	825.017	33.000	52.000	40.000	27000.000	20979.000	2080.000	10.414
50%	2011-01-01 00:00:00	0.764	200401045625016.000	1516.550	40.000	52.000	40.000	45000.000	34120.000	2080.000	15.779
75%	2017-01-01 00:00:00	0.882	201212064455521.000	2209.660	46.000	52.000	47.000	70000.000	53637.000	2340.000	24.228
max	2024-01-01 00:00:00	1.000	202403069322011.000	44423.830	54.000	52.000	99.000	2099999.000	1352399.356	5148.000	312342.282
std	NaN	0.118	96391382223115.375	1257.945	8.325	8.436	9.618	68994.539	50216.603	606.098	364.976

Step 6. Drop data that we no longer need.#

Run the code below to drop data that we don’t need.

# Drop the variables that don't need anymore.
# Your data might not have some of these variables. 
labels = ['GQ', 'SEX', 'LABFORCE', 'CPI99', 'INCTOT', 'WKSWORK1', ]
for label in labels:
    try:
        df.drop(labels=label, inplace=True, axis=1)
    except:
        pass   

Also, after you do this, you should drop all rows that contain a missing value.

df = wage_growth_analytics.s06_drop(df)

Step 7. Generate summary statistics#

Use describe to generate summary statistics for at least the following variables: AGE, INCWAGE, real_incwage, annual_hours, real_wage. (For now, disregard the fact that we need to use the survey weights to compute these figures.) What do you notice about the top end of the distribution of real_wage?

At this point, your answer should look like this:

# There seem to be some very large wages at the top end of `real_wage`
df.describe()

	YEAR	CPSIDV	ASECWT	AGE	UHRSWORKLY	INCWAGE	real_incwage	annual_hours	real_wage
count	792456	792456.000	792456.000	792456.000	792456.000	792456.000	792456.000	792456.000	792456.000
mean	2011-02-01 15:06:39.915200512	129026943321680.312	1699.807	39.518	43.340	59718.730	45214.235	2145.548	22.022
min	2000-01-01 00:00:00	0.000	25.330	25.000	1.000	1.000	0.726	1.000	0.000
25%	2005-01-01 00:00:00	0.000	825.017	33.000	40.000	27000.000	20979.000	2080.000	10.414
50%	2011-01-01 00:00:00	200401045625016.000	1516.550	40.000	40.000	45000.000	34120.000	2080.000	15.779
75%	2017-01-01 00:00:00	201212064455521.000	2209.660	46.000	47.000	70000.000	53637.000	2340.000	24.228
max	2024-01-01 00:00:00	202403069322011.000	44423.830	54.000	99.000	2099999.000	1352399.356	5148.000	312342.282
std	NaN	96391382223115.375	1257.945	8.325	9.618	68994.539	50216.603	606.098	364.976

df.query('real_wage > 300000')

	YEAR	CPSIDV	ASECWT	AGE	EDUC	UHRSWORKLY	INCWAGE	real_incwage	annual_hours	real_wage	in_labor_force
1944649	2009-01-01	200802018016011	3381.190	48	73	1.000	403543.000	312342.282	1.000	312342.282	True

Step 8. Plot a histogram of the average weekly hours and the average annual hours worked.#

That is, plot a histogram of UHRSWORKLY and for a histogram of annual_hours.

df[['UHRSWORKLY']].plot.hist();

../_images/e7b97c7715b27ed2b93b5feb40a402685e36ec6694f08f6c4ce338de156b6394.png

df['annual_hours'].plot.hist();

../_images/c8f054bdea852c1055443d528b89a2ede6d31412ad15b46a82af7c614f85ea3a.png

Step 9. Plot a histogram of real wages.#

Plot a histogram of real_wage using the default number of bins in pandas.
Plot a histogram of real_wage for only those with real_wage less than 200. Use the default number of bins.
Plot a histogram of real_wage for only those with real_wage less than 200. Use 50 bins.

df['real_wage'].hist();

../_images/ffcb01f3ef066dd9f22bdabfcc772163a96288eb93558f787bfc125f65541506.png

df.loc[df['real_wage'] < 200,'real_wage'].hist();

../_images/464f813ffa407bf20a27278fa12c99d880afe81ddd0254eb737477cbebc0e45d.png

df.loc[df['real_wage'] < 200,'real_wage'].hist(bins=50)
plt.title('Wages in sample');

../_images/362637884da0ce55ea6235eba5ed9a6fe20ca33521ca27b7c73c267f33361bbf.png

Step 10. Drop all rows that have wages above the 99th percentile and below the 1st percentile.#

For example, use the code q01 = df.real_wage.quantile(q=0.01) to compute the 1st percentile. Then use df = df[YOUR_MASK_HERE] or use df = df.query(YOUR_CODE_HERE).

df = wage_growth_analytics.s10_drop_by_percentiles(df)

df.describe()

	YEAR	CPSIDV	ASECWT	AGE	UHRSWORKLY	INCWAGE	real_incwage	annual_hours	real_wage
count	776592	776592.000	776592.000	776592.000	776592.000	776592.000	776592.000	776592.000	776592.000
mean	2011-02-05 00:10:56.072686848	129074592829024.172	1700.176	39.494	43.366	56126.002	42509.740	2149.628	19.461
min	2000-01-01 00:00:00	0.000	25.330	25.000	1.000	8.000	6.432	1.000	2.383
25%	2005-01-01 00:00:00	0.000	825.170	32.000	40.000	27000.000	21252.000	2080.000	10.494
50%	2011-01-01 00:00:00	200401049872011.000	1516.385	40.000	40.000	45000.000	34120.000	2080.000	15.779
75%	2017-01-01 00:00:00	201212070162021.000	2210.150	46.000	46.000	70000.000	53091.000	2340.000	23.986
max	2024-01-01 00:00:00	202403069322011.000	44423.830	54.000	99.000	850000.000	554200.000	5148.000	116.763
std	NaN	96377910514422.531	1258.375	8.322	9.485	47781.583	35118.563	596.974	14.015

df['real_wage'].hist(bins=50)

<Axes: >

../_images/50377f7f49ab9b8b0eab34a36bf7fe4eb6272d853ce9e5d7b8beae508965f06f.png

3. Recreate a Wage Series Like The One From FRED#

Step 11. Create time series for wages and employment#

These first series that we create will not be demographically adjusted. Create these series with the following variable names: median_wages, ave_wages, and employment. Each of these should be computed using the survey weights. Notice that np.average has a keyword argument for weights. To construct a weighted median, we will use the weightedstats package.

Here is the code to compute the weighted median. Fill in the rest of the code for the other variables.

col = 'real_wage'
weights='ASECWT'

median_wages = (
    df
    .dropna(subset=[col], how='any')
    .groupby('YEAR')
    .apply(lambda row: weightedstats.weighted_median(row[col], weights=row[weights]))
    # Shift time back since the series represents wages from the previous year
    .shift(-1) 
)

NOTE: Wage data needs to be shifted backwards, but employment data does not need to be.

median_wages = wage_growth_analytics.s11_median_wages(df)
ave_wages = wage_growth_analytics.s11_ave_wages(df)
employment = wage_growth_analytics.s11_employment(df)

Step 12. Concatenate these time series variables together.#

Use pd.concat to create a new DataFrame called tdf that contains median_wages, ave_wages, and employment. Make sure that the column names are preserved in tdf.

Running tdf.head() should produce the following result at this point.

tdf = wage_growth_analytics.s12_time_series(df)

tdf.head()

	ave_wages	median_wages	employment
YEAR
2000-01-01	18.932	15.807	0.966
2001-01-01	19.542	15.834	0.965
2002-01-01	19.179	15.582	0.966
2003-01-01	18.915	15.663	0.964
2004-01-01	18.743	15.508	0.964

Step 13. Plot the series#

Plot ave_wages and median_wages on the same plot. Make another plot for employment.

tdf[['ave_wages', 'median_wages']].plot(title='Wages over time')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5));

../_images/ca13905b2ed9ced95f4bb53c1e507dd8a3cfdbb45662b077b23826193b804008.png

employment.plot(title='Employment')

<Axes: title={'center': 'Employment'}, xlabel='YEAR'>

../_images/2f45f421078d22a16f110833db8b35aceb5a088641335d9f1a32a6db54d30ac3.png

Step 14. Compute the autocorrelation of each variables in `tdf`#

HINT: You can compute the autocorrelation of a single series using, e.g., tdf.ave_wages.autocorr(). You can do this for each series in tdf in one go by using tdf.apply(lambda x: YOUR_CODE_HERE).

#autocorrelation
tdf.apply(lambda x: x.autocorr())

ave_wages      0.906
median_wages   0.797
employment     0.177
dtype: float64

Step 15. Compute the contemporeneous correlation between the variables in `tdf`.#

This amounts to simply calculating the correlation matrix.

tdf.corr()

	ave_wages	median_wages	employment
ave_wages	1.000	0.844	-0.193
median_wages	0.844	1.000	-0.180
employment	-0.193	-0.180	1.000

Step 16. Plot median wages and employment on the same graph#

Use the following starter code to plot these two series on the same graph, with separate y-axes (a left y-axis and a right y-axis).

ax = tdf[['median_wages']].plot(legend=False, color='red', label='Wght. Median Wages')
ax.grid(False)
ax.set_ylabel('Median Wages')
ax2 = ax.twinx()
ax2.grid(False)
ax2.set_ylabel('Employment')
tdf.employment.plot(ax=ax2)
ax.figure.legend()

<matplotlib.legend.Legend at 0x177e67c50>

../_images/e0488f94f4cc3f645795eaa1bdb1865443c1b0ddf6c5867755618bff51e748ce.png

Step 17. Plot median wages growth and employment on the same graph#

First, create new variables in tdf called ave_wage_growth and median_wage_growth. Apply the method pct_change() to the appropriate columns to create the new variable. This will compute the percent changes in each variable over time. Use the same code as from before to make the plot, but substitute in your new variable. Be sure to update the label on the axes.

Note: When computing various statistics, for example correlations, we want to beware of variables that are steadily growing or shrinking over time. Variables like this are said to be non-stationary. Usage of such variables without care can often lead to spurious correlations. In time series econometrics, there are various ways of transforming these variables into stationary variables. One method, which we’ll use here, is to compute percentage differences. Since we believe wages tend to increase over time, we want to difference (in our case, compute percentage differences) the wage variable. Our employment variable, however, is a ratio between zero and one that is not steadily growing or shrinking. So it’s ok.

tdf['ave_wage_growth'] = tdf['ave_wages'].pct_change(fill_method=None)
tdf['median_wage_growth'] = tdf['median_wages'].pct_change(fill_method=None)

ax = tdf['median_wage_growth'].plot(legend=False, color='red')
ax.grid(False)
ax.set_ylabel('Median Wages Growth')
ax2 = ax.twinx()
ax2.grid(False)
ax2.set_ylabel('Employment')
tdf['employment'].plot(ax=ax2)
ax.figure.legend();

../_images/343e920a3e1663c9aad89fe57d9bcf27dc9ba447aac0bd6473263a69de0c1463.png

## Extra:
# Note that it appears that employment and wage growth
# is slightly out of phase. This could be because the two
# are negatively correlated. Or maybe it's because we shouldn't
# have shifted wages back---even though it seemed like the appropriate
# thing to do given the definition of wages. We should probably
# leave it as it is, though.
print(tdf[['median_wage_growth']].corrwith(tdf.employment))
print(tdf[['median_wage_growth']].shift(1).corrwith(tdf.employment))

median_wage_growth   -0.039
dtype: float64
median_wage_growth   -0.144
dtype: float64

Step 18. Plot median wages growth and employment growth on the same graph#

Create a new variable in tdf called employment_growth. Again use .pct_change() to create it. Use the same code as from before. However, this time let’s explore the relationship of wages with employment growth rather than the employment level. Be sure to update the labels of the axes so that we know what we’re looking at.

tdf['employment_growth'] = tdf['employment'].pct_change(fill_method=None)

ax = tdf['median_wage_growth'].plot(legend=False, color='red')
ax.grid(False)
ax.set_ylabel('Median wage growth')
ax2 = ax.twinx()
ax2.grid(False)
ax2.set_ylabel('Employment growth')
tdf['employment_growth'].plot(ax=ax2)
ax.figure.legend()

<matplotlib.legend.Legend at 0x34fe7b890>

../_images/675f80e4bc3f9c4e1a3d9028edf6ddfc620a5a9d760b2635601f59465d419cf7.png

Step 19. Use the code below as well as the graphs from before to describe what we’re seeing in the data.#

Use .corr() to compute the correlations between the various wage growth series and the measures of employment in tdf. Give a summary of what you’re seeing in the data. How does it match up with your intution and with your understanding of supply and demand in labor markets?

tdf.corr()

	ave_wages	median_wages	employment	ave_wage_growth	median_wage_growth	employment_growth
ave_wages	1.000	0.844	-0.193	0.323	0.190	-0.102
median_wages	0.844	1.000	-0.180	0.231	0.322	-0.142
employment	-0.193	-0.180	1.000	0.110	-0.039	0.639
ave_wage_growth	0.323	0.231	0.110	1.000	0.596	-0.287
median_wage_growth	0.190	0.322	-0.039	0.596	1.000	-0.260
employment_growth	-0.102	-0.142	0.639	-0.287	-0.260	1.000

Solution:

This is an open-ended question. As long as you provide a reasonable response, that should be fine. I see that wages growth is slightly positively correlated with the level of employment. It’s small enough, though, that I might say that they are uncorrelated. Median wage growth appears to be negatively correlated with employment. My intuition, coming from my understanding of supply and demand, is that wage growth should be higher when employment is higher. They should be positively correlated. This doesn’t appear to be the case. We see a similar fact when we compare wage growth to employment growth. The same intuition seems like it could apply when comparing wage growth and employment growth.

4. Create Demographically Adjusted Series#

Here we will demographically adjust the data to adjust for composition changes over time (based on observables). As employment rates fell, they fell more for low educated workers than high educated workers. That means the average wages you measured previously will actually be comparing different types of workers in 2010 relative to 2006. To demographically adjust the data, create age-skill groups in each year. Use 5 year age ranges (25-29, 30-34, etc.) and 5 education groupings (less than high school, high school only, some college, bachelor’s degree, more than a bachelor’s degree). Compute wages within these cells for each year (again weighting the data using survey weights). However, when computing time series trends, we will fix the population weights for each cell at year 2000 levels (i.e., the share of the population that lives in each group). So, for each year, you will compute a measure of wages by multiplying the means within each cell during that year by the 2000 shares of population in each cell (and then sum across cells within the year). By doing this, you will be able to compute a demographically adjusted wage series during the 2000 – 2022 period (we have data up to 2023, but the wage variables refer to the wages earned in the previous year). We will discuss how these demographically adjusted wage series compare to your series from before.

Step 20. Create new variables in `df` for age and education bins.#

To demographically adjust the data, create age-skill groups in each year. Use 5 year age ranges (25-29, 30-34, etc.) and 5 education groupings (less than high school, high school only, some college, bachelor’s degree, more than a bachelor’s degree).

I have provided starter code here:

# Add bins to `df` for the AGE and EDUC variables as described in
# Question 1.B of the HW.
bins = [25, ... YOUR_CODE_HERE ... , 55]
df['age_binned'] = pd.cut(df['AGE'], bins=bins, include_lowest=True, right=False)

educ_bins = [0, 72, 73, YOUR_CODE_HERE , 900]
educ_bin_labels = ['Some_High_School', 'High_School_Diploma', 'Some_College',
                   'Bachelors_Degree', 'Beyond_Bachelors']
df['educ_binned'] = pd.cut(df['EDUC'], bins=educ_bins, labels=YOUR_CODE_HERE, include_lowest=True)

df = wage_growth_analytics.s20_bin_vars(df)

Step 21. Compute the average wages within each education and age group as defined above.#

Within in “cell”, compute the average wage. You will need to use the survey weights to compute each mean within each cell. The result will be average real wages (deflated with CPI 99) computed over the cross-section and over time.

For a nicer display at the end, run a code cell with the code pd.DataFrame(group_means).

group_means = wage_growth_analytics.s21_within_group_averages(df)

group_means

age_binned  educ_binned        
[25, 30)    Some_High_School       9.869
            High_School_Diploma   12.180
            Some_College          13.508
            Bachelors_Degree      18.384
            Beyond_Bachelors      21.619
[30, 35)    Some_High_School      10.593
            High_School_Diploma   13.535
            Some_College          15.734
            Bachelors_Degree      22.128
            Beyond_Bachelors      26.842
[35, 40)    Some_High_School      11.105
            High_School_Diploma   14.877
            Some_College          17.820
            Bachelors_Degree      25.552
            Beyond_Bachelors      31.572
[40, 45)    Some_High_School      11.688
            High_School_Diploma   15.601
            Some_College          18.768
            Bachelors_Degree      26.943
            Beyond_Bachelors      33.668
[45, 50)    Some_High_School      12.038
            High_School_Diploma   16.029
            Some_College          19.536
            Bachelors_Degree      27.818
            Beyond_Bachelors      34.648
[50, 55)    Some_High_School      12.267
            High_School_Diploma   16.575
            Some_College          19.923
            Bachelors_Degree      27.838
            Beyond_Bachelors      34.816
Name: average_wage, dtype: float64

pd.DataFrame(group_means)

		average_wage
age_binned	educ_binned
[25, 30)	Some_High_School	9.869
	High_School_Diploma	12.180
	Some_College	13.508
	Bachelors_Degree	18.384
	Beyond_Bachelors	21.619
[30, 35)	Some_High_School	10.593
	High_School_Diploma	13.535
	Some_College	15.734
	Bachelors_Degree	22.128
	Beyond_Bachelors	26.842
[35, 40)	Some_High_School	11.105
	High_School_Diploma	14.877
	Some_College	17.820
	Bachelors_Degree	25.552
	Beyond_Bachelors	31.572
[40, 45)	Some_High_School	11.688
	High_School_Diploma	15.601
	Some_College	18.768
	Bachelors_Degree	26.943
	Beyond_Bachelors	33.668
[45, 50)	Some_High_School	12.038
	High_School_Diploma	16.029
	Some_College	19.536
	Bachelors_Degree	27.818
	Beyond_Bachelors	34.648
[50, 55)	Some_High_School	12.267
	High_School_Diploma	16.575
	Some_College	19.923
	Bachelors_Degree	27.838
	Beyond_Bachelors	34.816

Step 22. Display average wages within each education and age group as a table.#

Use the group_means variable from before. Use unstack. Be sure to rearrange to columns so that the education levels appear in the correct order, from lowest educated to highest.

The end results should look like this:

Means within each group

a = group_means.unstack()
cols = ['Some_High_School', 'High_School_Diploma', 'Some_College', 'Bachelors_Degree', 'Beyond_Bachelors']
# I am doing this to reorder the columns in the way that I want.
a = a[cols]
a

educ_binned	Some_High_School	High_School_Diploma	Some_College	Bachelors_Degree	Beyond_Bachelors
age_binned
[25, 30)	9.869	12.180	13.508	18.384	21.619
[30, 35)	10.593	13.535	15.734	22.128	26.842
[35, 40)	11.105	14.877	17.820	25.552	31.572
[40, 45)	11.688	15.601	18.768	26.943	33.668
[45, 50)	12.038	16.029	19.536	27.818	34.648
[50, 55)	12.267	16.575	19.923	27.838	34.816

Step 23. Plot the previous table as a heatmat.#

Use seaborn’s heatmap function. Just input the table from before into sns.heatmap. Be sure to use plt.title to give the plot an appropriate title.

sns.heatmap(a)
plt.title('Average Wages, by group');

../_images/cb17bcddba7ef01602769473a09074b0ee473235871bb56e113dbdd6e4db6b25.png

Step 24. Create the demographically adjusted series#

Now we will compute a demographically adjusted average wage series. To do this, average real wages within demographic groups. These groups will be grouped by year, age bin, and education level. These inner_means will be weighted using survey weights. Recall that the survey weights are given in the CPS data to tell us how to weight the data so that it is a statisitcally representative sample. We will then sum these survey weights within each group, or “cell”. We will then use these summed survey weights from the age and education groups, and freeze them at year 2000. We will use these frozen weights to compute an adjusted time series for average wages for the rest of the years. Specifically, we will compute an average of the average group wages. (Note that an average of an average has a mathematically reasonable meaning.)

Use the following started code to complete this:

After doing this, construct the variable adj_ave_wage_growth using .pct_change as before.

tdf = wage_growth_analytics.s24_demographically_adj_series(df, tdf)

tdf.tail()

	ave_wages	median_wages	employment	ave_wage_growth	median_wage_growth	employment_growth	adj_ave_wages	adj_ave_wage_growth
YEAR
2020-01-01	21.313	16.843	0.960	0.044	0.054	-0.007	20.129	0.039
2021-01-01	20.763	16.262	0.956	-0.026	-0.034	-0.004	19.555	-0.029
2022-01-01	20.464	16.003	0.967	-0.014	-0.016	0.012	19.121	-0.022
2023-01-01	20.542	15.779	0.968	0.004	-0.014	0.000	19.167	0.002
2024-01-01	NaN	NaN	0.964	NaN	NaN	-0.004	NaN	NaN

Step 25. Check understanding of usage of `groupby` above.#

Explain the following. Why did I use groupby(by=[]) in one place and groupby(level=...) in another place? Explain the difference between the by keyword and the level keyword in the groupby function. (HINT: Check the documentation and run the code line by line if it’s unclear.)

# The difference is that one groups by columns and the other groups by a level in a multi-index.

Step 26. Plot the unadjusted and the demographically adjusted series for average wages on the same plot.#

Also, explain what you see.

ANSWER:

In the plot immediately below, we see that the demographically adjusted wage series (adj_ave_age) coincides with the unadjusted series in the year 1999 (data from 2000). They diverge as time goes on. This is as expected, and serves as a check that we have performed our calculations correctly. The adjusted wage series is lower than the unadjusted. This suggests that the composition of the labor force was tilting towards higher earning demographics.

# In the plot immediately below, we see that the demographically 
# adjusted wage series (`adj_ave_age`) coincides with the unadjusted 
# series in the year 1999 (data from 2000). 
# They diverge as time goes on. This is as expected, and serves as a 
# check that we have performed our calculations correctly. 
# The adjusted wage series is lower than the unadjusted. 
# This suggests that the composition of the labor force was tilting 
# towards higher earning demographics.
tdf[['ave_wages', 'adj_ave_wages']].plot()
plt.title("Demographically Adjusted Average Wages");

../_images/1228bdb1f21166bdfa1f6afdeda302a615ecb5f6731cd0e6a74e26d5097e9243.png

Step 27. Repeat the calculations from Q19, except include the new adjusted series as well.#

What do you see? Is it still puzzling? (NOTE: This is mostly an open-ended question.)

ANSWER:

Now, if we repeat our calculations with regards to the time series correlation between wage growth and employment, we see that demographically adjusting wage growth lowers the correlation. The correlation of wage growth and employment growth become more negative. This is also a little puzzling.

tdf.corr()

	ave_wages	median_wages	employment	ave_wage_growth	median_wage_growth	employment_growth	adj_ave_wages	adj_ave_wage_growth
ave_wages	1.000	0.844	-0.193	0.323	0.190	-0.102	0.833	0.269
median_wages	0.844	1.000	-0.180	0.231	0.322	-0.142	0.927	0.203
employment	-0.193	-0.180	1.000	0.110	-0.039	0.639	-0.230	0.081
ave_wage_growth	0.323	0.231	0.110	1.000	0.596	-0.287	0.347	0.974
median_wage_growth	0.190	0.322	-0.039	0.596	1.000	-0.260	0.192	0.573
employment_growth	-0.102	-0.142	0.639	-0.287	-0.260	1.000	-0.267	-0.370
adj_ave_wages	0.833	0.927	-0.230	0.347	0.192	-0.267	1.000	0.340
adj_ave_wage_growth	0.269	0.203	0.081	0.974	0.573	-0.370	0.340	1.000

5. Conclusion#

Note that there is more that we could do with this. For one, we could try to account for changes in employment. For example, the procedure would work like this:

We could impute wages for those not working. So your sample will now expand to all workers – even those without a measured wage. We can do the imputation in a crude way. Using this new sample, define demographic groups by age-skill for each year. For each person without out a wage in a given year – match them to their demographic cell in that year. For your imputation, assign those with a missing wage in that year to the 33rd percentile of the wages within that cell (for that year). Note – those with the highest wages will be at the 99th percentile. This assumes that those with a missing wage (of a given age-skill type) are disproportionately drawn from the bottom part of their age-skill wage distribution. Still adjust the wages for changing demographic composition by fixing the weights for each demographic cell at their 2000 level (using the full sample including the zeros).

We could then ask ,how do the time series patterns in these demographically adjusted wages (including the zeros) compare to the wage series created previously? How would things change if you used the 50th percentile to conduct the imputation?

In addition, we could also repeat this exercise using data from the 2000 Census and the 2001-2014 American Community Survey (downloaded from the IPUMS USA website—a different data set from the IPUMS CPS data set that we used here).

All of these things could help us to better understand this puzzling fact that we see in the data.

HW Guide: Wage Growth During the Recession

Contents

HW Guide: Wage Growth During the Recession#

1. Introduction#

Step 1. Retrieve and plot this data from FRED#

2. Demographic data from US Current Population Survey#

Game Plan#

Note about ASEC#

Step 2. Download and Read the CPS Data from IPUMS#

UHRSWORKLY#

INCWAGE#

CPI99#

LABFORCE#

Step 3. Recode the missing data#

Step 4. Select the desired subsample of the data#

Step 5. Construct new variables#

Step 6. Drop data that we no longer need.#

Step 7. Generate summary statistics#

Step 8. Plot a histogram of the average weekly hours and the average annual hours worked.#

Step 9. Plot a histogram of real wages.#

Step 10. Drop all rows that have wages above the 99th percentile and below the 1st percentile.#

3. Recreate a Wage Series Like The One From FRED#

Step 11. Create time series for wages and employment#

Step 12. Concatenate these time series variables together.#

Step 13. Plot the series#

Step 14. Compute the autocorrelation of each variables in tdf#

Step 15. Compute the contemporeneous correlation between the variables in tdf.#

Step 16. Plot median wages and employment on the same graph#

Step 17. Plot median wages growth and employment on the same graph#

Step 18. Plot median wages growth and employment growth on the same graph#

Step 19. Use the code below as well as the graphs from before to describe what we’re seeing in the data.#

4. Create Demographically Adjusted Series#

Step 20. Create new variables in df for age and education bins.#

Step 21. Compute the average wages within each education and age group as defined above.#

Step 22. Display average wages within each education and age group as a table.#

Step 23. Plot the previous table as a heatmat.#

Step 24. Create the demographically adjusted series#

Step 25. Check understanding of usage of groupby above.#

Step 26. Plot the unadjusted and the demographically adjusted series for average wages on the same plot.#

Step 27. Repeat the calculations from Q19, except include the new adjusted series as well.#

5. Conclusion#

Step 14. Compute the autocorrelation of each variables in `tdf`#

Step 15. Compute the contemporeneous correlation between the variables in `tdf`.#

Step 20. Create new variables in `df` for age and education bins.#

Step 25. Check understanding of usage of `groupby` above.#