Homework 5#

Learning Outcomes#

  • Connect to UChicago’s Midway3 HPC cluster via SSH

  • Set up a working directory and conda environment on a shared cluster

  • Run a real corporate bond data pipeline (Clean TRACE) using SLURM batch jobs

  • Use rsync to copy pre-pulled data from the instructor’s project directory

  • Use scode to launch a remote VS Code instance on a compute node

  • Use SSH port forwarding to access remote services from your laptop

Assignment#

The assignment is located at the following GitHub Classroom link: https://classroom.github.com/a/xczDCvT_

Due Date: Mar 8, 2026 at 6 pm CT

This is a lightweight assignment designed to give you hands-on experience with the RCC Midway3 cluster. You will set up your environment, run the Clean TRACE data pipeline as a batch job, and launch a remote VS Code session. This is intentionally light so that you can focus on your final projects.


Part 1: Set Up Your Working Directory on RCC#

Step 1: SSH into Midway3#

Open a terminal on your laptop and connect to Midway3:

ssh <your-cnetid>@midway3.rcc.uchicago.edu

You will be prompted for your password, followed by Duo multi-factor authentication (MFA). After completing MFA, you will land on a login node.

Login nodes are shared

Login nodes are shared by all users. They are intended for lightweight tasks: editing files, submitting jobs, transferring data. Do not run computationally intensive work on login nodes. Use the job scheduler (SLURM) to request dedicated compute resources instead.

Step 2: Create your personal directory in project space#

Your home directory (/home/$USER) has a small quota (~30 GB) and is not suitable for large datasets or conda environments. Instead, use the shared project directory for this course:

Location

Path

Quota

Backed Up

Purpose

Home

/home/<cnetid>

30 GB

Yes

Config files, small scripts

Project

/project/finm32900

Large (shared)

Yes

Shared course data, virtual environments

Scratch

/scratch/midway3/<cnetid>

Large

No

Temporary large files, intermediate results

Create a personal directory in the project space:

mkdir -p /project/finm32900/${USER}

Step 3: Clone the Clean TRACE repository#

Clone the case study repository into your project directory:

cd /project/finm32900/${USER}
git clone <repo-url> case_study_clean_trace
cd case_study_clean_trace

Tip

Always work in /project/finm32900/${USER}/ for this course, not in /home/. The home directory quota fills up quickly with Python environments and data files.


Part 2: Set Up the Conda Environment#

Step 1: Load the Anaconda module#

Midway uses a module system to manage software. Load the Anaconda module:

module load python/anaconda-2024.10

You can see all available modules with:

module avail

Important

Do not run conda init on RCC. It modifies your shell profile in ways that conflict with the module system. Always use source activate instead of conda activate.

Step 2: Create a conda environment in project space#

Create an isolated conda environment in your project directory (not in /home/, to avoid quota issues):

conda create --prefix=/project/finm32900/${USER}/envs/clean_trace python=3.11 -y

Activate the environment and install dependencies:

source activate /project/finm32900/${USER}/envs/clean_trace
pip install -r requirements.txt

To reactivate this environment in future sessions:

module load python/anaconda-2024.10
source activate /project/finm32900/${USER}/envs/clean_trace

Step 3: Configure the environment file#

cp .env.example .env

Edit .env and set your WRDS username. Use the default small sample date range (January–February 2024) to keep the run quick:

WRDS_USERNAME="your_wrds_username"

Tip

By default, the pipeline processes only a 2-month sample (January–February 2024). This is intentional—it lets you run end-to-end quickly without processing 20+ years of data. Do not uncomment START_DATE or END_DATE unless you want a much longer run.

Step 4: Configure WRDS .pgpass#

If you haven’t already set up passwordless WRDS authentication, run the following on the login node (which has internet access):

python -c "import wrds; db = wrds.Connection(); db.close()"

This will prompt for your WRDS username and password and create the ~/.pgpass file.


Part 3: Run the Clean TRACE Pipeline#

Understanding the workflow#

Worker nodes on Midway3 do not have internet access, so data pulls (WRDS, Fama-French, etc.) cannot run inside a batch job. Instead, the instructor has already pulled the data once on the head node. The sbatch script will rsync this pre-pulled data into your project directory and then run the processing pipeline.

The instructor’s pre-pulled data is located at:

/project/finm32900/case_study_clean_trace/_data/pulled/

The sbatch script#

The repository includes a run-pipeline.sbatch script that automates everything. Here is the script with annotations:

#!/bin/bash
#SBATCH --job-name=clean-trace-pipeline
#SBATCH --output=%j_clean-trace-pipeline.out
#SBATCH --error=%j_clean-trace-pipeline.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=8000
#SBATCH --time=8:00:00
#SBATCH --account=finm32900

set -euo pipefail

# --- Configuration ---
PROJECT_BASE="/project/finm32900"
WORK_DIR="${PROJECT_BASE}/${USER}/case_study_clean_trace"
INSTRUCTOR_DATA="${PROJECT_BASE}/case_study_clean_trace/_data/pulled"
STUDENT_DATA="${WORK_DIR}/_data/pulled"

# --- Change to working directory ---
cd "${WORK_DIR}"
echo "[$(date)] Working directory: ${WORK_DIR}"

# --- Activate conda environment ---
module load python/anaconda-2024.10
source activate "${PROJECT_BASE}/${USER}/envs/clean_trace"
echo "[$(date)] Activated conda environment"

# --- Install/update dependencies ---
pip install --quiet -r requirements.txt
echo "[$(date)] Dependencies up to date"

# --- Rsync pulled data from instructor's project directory ---
echo "[$(date)] Syncing pulled data from instructor directory..."
mkdir -p "${STUDENT_DATA}"
rsync -a --info=progress2 "${INSTRUCTOR_DATA}/" "${STUDENT_DATA}/"
echo "[$(date)] Data sync complete"

# --- Tell doit to skip all pull tasks (data already synced) ---
echo "[$(date)] Marking pull tasks as ignored..."
doit ignore pull:fama_french pull:fisd pull:liu_wu pull:osbap \
    pull:trace_144a pull:trace_enhanced pull:trace_standard
echo "[$(date)] Pull tasks ignored"

# --- Run the full pipeline ---
echo "[$(date)] Starting pipeline..."
doit
echo "[$(date)] Pipeline complete"

Key details about this script:

Directive

Value

Meaning

--job-name

clean-trace-pipeline

Human-readable name for your job

--output / --error

%j_clean-trace-pipeline.out/.err

Stdout/stderr files (%j = job ID)

--nodes

1

Use a single compute node

--cpus-per-task

16

Request 16 CPU cores

--mem-per-cpu

8000

8 GB per CPU (128 GB total)

--time

8:00:00

Maximum wall time of 8 hours

--account

finm32900

Course billing account

The script does the following:

  1. Rsync the instructor’s pre-pulled data into your _data/pulled/ directory

  2. Mark pull tasks as ignored so that doit does not attempt to download data (compute nodes have no internet)

  3. Run the full pipeline with doit, which builds the FISD universe, filters TRACE, runs Stage 0 (Dick-Nielsen cleaning), and runs Stage 1 (bond analytics)

Submit the batch job#

sbatch run-pipeline.sbatch

You should see output like:

Submitted batch job 12345678

Monitor your job#

Use these commands to check on your job:

# Check job status
squeue -u ${USER}

# View all partitions and node availability
sinfo

# Check your account balance (compute allocation)
accounts balance

# Follow the job's stdout in real time
tail -f <jobid>_clean-trace-pipeline.out

# Follow stderr
tail -f <jobid>_clean-trace-pipeline.err

# Cancel a job if needed
scancel <jobid>

The pipeline processes 20+ years of TRACE corporate bond data through multiple stages. Depending on queue wait times and system load, the full run can take several hours.


Part 4: Launch VS Code on Midway with scode#

RCC provides a tool called scode that launches a VS Code web server on a compute node, which you can then connect to from your laptop’s browser via SSH port forwarding.

Step 1: Load the required modules#

On the Midway3 login node, load scode and the Anaconda module:

module load scode python/anaconda-2024.10

Step 2: Launch the VS Code server#

scode serve-web -- --account finm32900 --time 05:00:00 --mem 16G

This submits a SLURM job that starts a VS Code web server on a compute node. You should see output similar to:

Removing existing symlink: /home/jbejarano/.vscode/cli/serve-web/072586267e68ece9a47aa43f8c108e0dcbf44622.
Linked /home/jbejarano/.scode/versions/stable/1.109.5/vscode-server-linux-x64-web to /home/jbejarano/.vscode/cli/serve-web/072586267e68ece9a47aa43f8c108e0dcbf44622
Submitting SBATCH to serve VSCode environment.

Submitted batch job 45863567

sbatch: Verify job submission ...
sbatch: Using a shared partition ...
sbatch: Partition: caslake
sbatch: QOS-Flag: caslake
sbatch: Account: finm32900
sbatch: Verification: ***PASSED***

SBATCH job /home/jbejarano/.scode/sbatches/scode-web_20260223_105124.sbatch submitted successfully.
Output will be directed to /home/jbejarano/.scode/logs/scode-web_45863567.out.
Errors will be directed to /home/jbejarano/.scode/logs/scode-web_45863567.err.

VSCode server is starting with Slurm Job ID 45863567.
Use `scode jobs status 45863567` to get server status and connection info.
Use `scancel 45863567` to cancel the server job.

VSCode job 45863567 is running on 1 nodes: midway3-0067
Primary node: 10.50.250.67
Environment: /home/jbejarano/.scode/envs/stable/default

To connect to the VSCode Web GUI you need to create an SSH tunnel from your
local machine to the primary node above. This can be done with the following
command to be run on your local machine (e.g., PowerShell in Windows):

    ssh -L 8000:10.50.250.67:58992 jbejarano@midway3.rcc.uchicago.edu

Once the tunnel is created, you may access the VSCode Web GUI by entering
the following address in your browser:

    http://localhost:8000/?tkn=SOME_LONG_TOKEN

Server outputs are being written to /home/jbejarano/.scode/logs/scode-web_45863567.out.
Server errors are being written to /home/jbejarano/.scode/logs/scode-web_45863567.err.
Use `squeue -j 45863567` to check the status of the job.
Use `scancel 45863567` to cancel the job.

Step 3: Set up SSH port forwarding#

On your local machine (your laptop), run the SSH tunnel command shown in the scode output. Add the -N flag so that SSH only does port forwarding and does not open a remote shell:

ssh -N -L 8000:<compute_node_ip>:<remote_port> <your-cnetid>@midway3.rcc.uchicago.edu

For example, using the output above:

ssh -N -L 8000:10.50.250.67:58992 jbejarano@midway3.rcc.uchicago.edu

You can optionally add the -f flag to run the tunnel in the background.

Step 4: Connect via your browser#

Open a browser and navigate to the URL shown in the scode output:

http://localhost:8000/?tkn=SOME_LONG_TOKEN

You should see the VS Code editor running in your browser, connected to the Midway3 compute node. You can edit files, open terminals, and work as if you were using VS Code locally — but all computation happens on the cluster.

Step 5: Managing the scode session#

# Check the status of your scode server
scode jobs status <jobid>

# Check the job in the SLURM queue
squeue -j <jobid>

# Cancel the session when you're done
scancel <jobid>

Always clean up

When you are done working, cancel the scode job with scancel. Idle jobs still consume your account’s compute allocation. Don’t leave sessions running overnight.


Part 5: Deliverables#

Submit the following screenshots to demonstrate that you completed the assignment:

  1. Pipeline completion: A screenshot showing that the pipeline ran successfully in your own directory at /project/finm32900/$USER/case_study_clean_trace/. This could be the final lines of your job’s .out file showing "Pipeline complete", or a listing of the output files in _data/stage1/.

  2. VS Code via scode: A screenshot of VS Code running in your browser via scode, showing that you are connected to a Midway3 compute node.

Tip

You can use tail to show the end of your job output file:

tail -20 <jobid>_clean-trace-pipeline.out

Or list the pipeline outputs:

ls -lh /project/finm32900/${USER}/case_study_clean_trace/_data/stage1/