Week 8: Medium-Sized Data and Remote Machines

Week 8: Medium-Sized Data and Remote Machines#

Understand strategies for working with medium-sized datasets (1GB-100GB)
Compare Pandas and Polars for data processing at scale
Understand lazy evaluation, predicate pushdown, streaming, and Hive partitioning
Introduction to TRACE corporate bond data
Connect to remote machines via SSH and transfer files with rsync
Understand HPC cluster architecture (login nodes, compute nodes, storage)
Submit and manage jobs with SLURM (sinteractive, sbatch)
Understand why data pipelines must decouple internet-dependent pulls from processing
Set up SSH port forwarding to access Jupyter notebooks on remote compute nodes