Week 8: Medium-Sized Data and Remote Machines

Week 8: Medium-Sized Data and Remote Machines#

Agenda#

  • Medium-sized data strategies and Polars

  • Remote machines and HPC

  • Go over Homework 5: running the Clean TRACE pipeline on RCC

Learning Outcomes#

  • Understand strategies for working with medium-sized datasets (1GB-100GB)

  • Compare Pandas and Polars for data processing at scale

  • Understand lazy evaluation, predicate pushdown, streaming, and Hive partitioning

  • Introduction to TRACE corporate bond data

  • Connect to remote machines via SSH and transfer files with rsync

  • Understand HPC cluster architecture (login nodes, compute nodes, storage)

  • Submit and manage jobs with SLURM (sinteractive, sbatch)

  • Understand why data pipelines must decouple internet-dependent pulls from processing

  • Set up SSH port forwarding to access Jupyter notebooks on remote compute nodes