Week 8: Medium-Sized Data and Remote Machines#
Agenda#
Medium-sized data strategies and Polars
Remote machines and HPC
Go over Homework 5: running the Clean TRACE pipeline on RCC
Learning Outcomes#
Understand strategies for working with medium-sized datasets (1GB-100GB)
Compare Pandas and Polars for data processing at scale
Understand lazy evaluation, predicate pushdown, streaming, and Hive partitioning
Introduction to TRACE corporate bond data
Connect to remote machines via SSH and transfer files with rsync
Understand HPC cluster architecture (login nodes, compute nodes, storage)
Submit and manage jobs with SLURM (sinteractive, sbatch)
Understand why data pipelines must decouple internet-dependent pulls from processing
Set up SSH port forwarding to access Jupyter notebooks on remote compute nodes