1. Common HPC Setup

Before installing individual ML tools, ensure your HPC environment is properly configured. This guide covers the foundational setup that all subsequent modules depend on.

Resource Requirements Overview

Most ML protein tools share similar computational requirements. Here’s a general guide:

Resource	Minimum	Recommended	Notes
GPU RAM	16 GB	40+ GB	A100 80GB ideal for large proteins
CPU RAM	32 GB	64 GB	More for MSA generation
Disk Space	50 GB	200+ GB	Model weights + databases
CUDA	11.6+	12.1+	Check tool-specific requirements

Checking Your HPC Environment

Mark as complete

Internet Access on HPC

Many HPC clusters do not have internet access on compute nodes (the nodes where your heavy jobs run). They often only have internet on “login” or “head” nodes.

Downloads: Always run installation and download commands on a login node.
Execution: When running jobs, ensure your tools don’t try to download models on the fly. Pre-download all weights and databases.

1. Check Available CUDA Modules

module avail cuda

This shows all CUDA versions installed on your cluster. Note the versions - you’ll need to match them to tool requirements.

2. Check GPU Availability

Request an interactive GPU session:

# SLURM example
srun --gpus=1 --pty bash

Then check GPU status:

nvidia-smi

This shows: - GPU model (A100, V100, RTX 4090, etc.) - GPU memory (important for large models) - Current CUDA driver version

3. Check CUDA Toolkit Version

nvcc --version

If this fails, load a CUDA module first:

module load cuda/12.1
nvcc --version

Conda/Mamba Setup

Mark as complete

Most tools use Conda environments. Mamba is recommended as it’s significantly faster than Conda for dependency resolution.

Installing Mamba (if not available)

If your HPC doesn’t have Mamba, install Miniforge:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh

Follow prompts, then restart your shell or run:

source ~/.bashrc

Best Practices for HPC Conda Usage

Use dedicated environment directories: Set environment location to avoid filling home directory quota:

# Add to ~/.condarc
envs_dirs:
  - /scratch/$USER/conda_envs
pkgs_dirs:
  - /scratch/$USER/conda_pkgs

One environment per tool: Don’t try to install all tools in one environment - dependency conflicts are common.
Export environments for reproducibility:

mamba env export > environment.yml

Docker vs Singularity/Apptainer

Mark as complete

IMPORTANT: Most academic HPCs do NOT support Docker for security reasons. Use Singularity or Apptainer instead.

Loading Container Runtime

module load apptainer
# or on older systems:
module load singularity

Converting Docker Commands to Apptainer

Many tool READMEs show Docker commands. Here’s how to translate them:

Docker Command	Apptainer Equivalent
`docker run`	`apptainer run`
`docker run --gpus all`	`apptainer run --nv`
`docker run -v /path:/path`	`apptainer run --bind /path:/path`
`docker pull image:tag`	`apptainer pull docker://image:tag`

Example conversion:

# Docker (won't work on HPC):
docker run --gpus all -v $(pwd):/workspace myimage:latest python script.py

# Apptainer (works on HPC):
apptainer run --nv --bind $(pwd):/workspace myimage.sif python script.py

Pulling Docker Images as Singularity Files

apptainer pull docker://nvcr.io/nvidia/pytorch:23.10-py3
# Creates: pytorch_23.10-py3.sif

SLURM Job Submission Basics

Mark as complete

Most HPCs use SLURM for job scheduling. Here’s a template for ML jobs:

#!/bin/bash
#SBATCH --job-name=my_ml_job
#SBATCH --partition=gpu          # GPU partition name (varies by cluster)
#SBATCH --gpus=1                  # Number of GPUs
#SBATCH --cpus-per-task=8         # CPUs for data loading
#SBATCH --mem=64G                 # RAM
#SBATCH --time=04:00:00           # Wall time (HH:MM:SS)
#SBATCH --output=%x_%j.out        # Output file (%x=job name, %j=job ID)
#SBATCH --error=%x_%j.err         # Error file

# Load required modules
module load cuda/12.1
module load apptainer

# Activate conda environment
# source ~/.bashrc  # Source your shell profile if needed
source /path/to/your/miniconda3/etc/profile.d/conda.sh # Better: source conda.sh directly
mamba activate my_env

# Run your command
python my_script.py

Common SLURM Commands

Command	Description
`sbatch script.sh`	Submit job
`squeue -u $USER`	Check your jobs
`scancel JOB_ID`	Cancel a job
`sinfo`	Show partition info
`sacct -j JOB_ID`	Job accounting info

GPU Partition Names

GPU partition names vary by cluster. Common names:

gpu, gpus, gpu-shared
a100, v100, rtx
gpu-debug (for testing)

Check your cluster’s documentation or run sinfo to see available partitions.

Environment Variables

Mark as complete

Several tools use environment variables. Add these to your ~/.bashrc:

# Model weight storage (prevents filling home directory)
export TORCH_HOME=/scratch/$USER/torch_cache
export HF_HOME=/scratch/$USER/huggingface_cache
export TRANSFORMERS_CACHE=/scratch/$USER/transformers_cache

# ColabFold databases
export COLABFOLD_DOWNLOAD_DIR=/scratch/$USER/colabfold_db

# Chai-1 models
export CHAI_DOWNLOADS_DIR=/scratch/$USER/chai_models

# General cache
export XDG_CACHE_HOME=/scratch/$USER/.cache

Replace /scratch/$USER with your cluster’s scratch or work directory path.

Verifying GPU Works with PyTorch

Mark as complete

After setting up an environment with PyTorch, verify GPU access:

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    print(f"GPU name: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

    # Quick computation test
    x = torch.randn(1000, 1000, device='cuda')
    y = torch.matmul(x, x)
    print("GPU computation test: PASSED")

Save as test_gpu.py and run:

python test_gpu.py

Expected output (example):

PyTorch version: 2.1.0
CUDA available: True
CUDA version: 12.1
GPU count: 1
GPU name: NVIDIA A100-SXM4-80GB
GPU memory: 84.9 GB
GPU computation test: PASSED

Understanding GPU Memory Requirements

Different tasks require different GPU memory:

Task	Typical GPU Memory
Structure prediction (small protein <200 aa)	8-16 GB
Structure prediction (large protein >500 aa)	32-80 GB
Protein design (RFdiffusion2)	16-32 GB
Docking (DiffDock-PP, PLACER)	8-16 GB
Language model inference (ESM3)	16-40 GB
Binder design (BindCraft)	32-80 GB

If you get out-of-memory errors: 1. Request a GPU with more memory 2. Reduce batch size or sequence length 3. Use CPU offloading if available 4. Process sequences in chunks

Troubleshooting Common Issues

Common Issues

“CUDA out of memory”

Request more GPU memory
Reduce batch size
Use gradient checkpointing if training

“No CUDA runtime found”

module load cuda/12.1  # Load CUDA module
nvcc --version         # Verify it loaded

“Singularity: command not found”

module load apptainer  # or: module load singularity

Conda environment activation fails in SLURM

Add to your job script:

# Source conda.sh directly (adjust path to your installation)
source /path/to/miniforge3/etc/profile.d/conda.sh
mamba activate my_env

Permission denied on container

chmod +x container.sif