1. Common HPC Setup

Before installing individual ML tools, ensure your HPC environment is properly configured. This guide covers the foundational setup that all subsequent modules depend on.

Resource Requirements Overview

Most ML protein tools share similar computational requirements. Here’s a general guide:

Resource Minimum Recommended Notes
GPU RAM 16 GB 40+ GB A100 80GB ideal for large proteins
CPU RAM 32 GB 64 GB More for MSA generation
Disk Space 50 GB 200+ GB Model weights + databases
CUDA 11.6+ 12.1+ Check tool-specific requirements

Checking Your HPC Environment

Mark as complete

TipInternet Access on HPC

Many HPC clusters do not have internet access on compute nodes (the nodes where your heavy jobs run). They often only have internet on “login” or “head” nodes.

  • Downloads: Always run installation and download commands on a login node.
  • Execution: When running jobs, ensure your tools don’t try to download models on the fly. Pre-download all weights and databases.

1. Check Available CUDA Modules

module avail cuda

This shows all CUDA versions installed on your cluster. Note the versions - you’ll need to match them to tool requirements.

2. Check GPU Availability

Request an interactive GPU session:

# SLURM example
srun --gpus=1 --pty bash

Then check GPU status:

nvidia-smi

This shows: - GPU model (A100, V100, RTX 4090, etc.) - GPU memory (important for large models) - Current CUDA driver version

3. Check CUDA Toolkit Version

nvcc --version

If this fails, load a CUDA module first:

module load cuda/12.1
nvcc --version

Conda/Mamba Setup

Mark as complete

Most tools use Conda environments. Mamba is recommended as it’s significantly faster than Conda for dependency resolution.

Installing Mamba (if not available)

If your HPC doesn’t have Mamba, install Miniforge:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh

Follow prompts, then restart your shell or run:

source ~/.bashrc

Best Practices for HPC Conda Usage

  1. Use dedicated environment directories: Set environment location to avoid filling home directory quota:
# Add to ~/.condarc
envs_dirs:
  - /scratch/$USER/conda_envs
pkgs_dirs:
  - /scratch/$USER/conda_pkgs
  1. One environment per tool: Don’t try to install all tools in one environment - dependency conflicts are common.

  2. Export environments for reproducibility:

mamba env export > environment.yml

Docker vs Singularity/Apptainer

Mark as complete

IMPORTANT: Most academic HPCs do NOT support Docker for security reasons. Use Singularity or Apptainer instead.

Loading Container Runtime

module load apptainer
# or on older systems:
module load singularity

Converting Docker Commands to Apptainer

Many tool READMEs show Docker commands. Here’s how to translate them:

Docker Command Apptainer Equivalent
docker run apptainer run
docker run --gpus all apptainer run --nv
docker run -v /path:/path apptainer run --bind /path:/path
docker pull image:tag apptainer pull docker://image:tag

Example conversion:

# Docker (won't work on HPC):
docker run --gpus all -v $(pwd):/workspace myimage:latest python script.py

# Apptainer (works on HPC):
apptainer run --nv --bind $(pwd):/workspace myimage.sif python script.py

Pulling Docker Images as Singularity Files

apptainer pull docker://nvcr.io/nvidia/pytorch:23.10-py3
# Creates: pytorch_23.10-py3.sif

SLURM Job Submission Basics

Mark as complete

Most HPCs use SLURM for job scheduling. Here’s a template for ML jobs:

#!/bin/bash
#SBATCH --job-name=my_ml_job
#SBATCH --partition=gpu          # GPU partition name (varies by cluster)
#SBATCH --gpus=1                  # Number of GPUs
#SBATCH --cpus-per-task=8         # CPUs for data loading
#SBATCH --mem=64G                 # RAM
#SBATCH --time=04:00:00           # Wall time (HH:MM:SS)
#SBATCH --output=%x_%j.out        # Output file (%x=job name, %j=job ID)
#SBATCH --error=%x_%j.err         # Error file

# Load required modules
module load cuda/12.1
module load apptainer

# Activate conda environment
# source ~/.bashrc  # Source your shell profile if needed
source /path/to/your/miniconda3/etc/profile.d/conda.sh # Better: source conda.sh directly
mamba activate my_env

# Run your command
python my_script.py

Common SLURM Commands

Command Description
sbatch script.sh Submit job
squeue -u $USER Check your jobs
scancel JOB_ID Cancel a job
sinfo Show partition info
sacct -j JOB_ID Job accounting info

GPU Partition Names

GPU partition names vary by cluster. Common names:

  • gpu, gpus, gpu-shared
  • a100, v100, rtx
  • gpu-debug (for testing)

Check your cluster’s documentation or run sinfo to see available partitions.

Environment Variables

Mark as complete

Several tools use environment variables. Add these to your ~/.bashrc:

# Model weight storage (prevents filling home directory)
export TORCH_HOME=/scratch/$USER/torch_cache
export HF_HOME=/scratch/$USER/huggingface_cache
export TRANSFORMERS_CACHE=/scratch/$USER/transformers_cache

# ColabFold databases
export COLABFOLD_DOWNLOAD_DIR=/scratch/$USER/colabfold_db

# Chai-1 models
export CHAI_DOWNLOADS_DIR=/scratch/$USER/chai_models

# General cache
export XDG_CACHE_HOME=/scratch/$USER/.cache

Replace /scratch/$USER with your cluster’s scratch or work directory path.

Verifying GPU Works with PyTorch

Mark as complete

After setting up an environment with PyTorch, verify GPU access:

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    print(f"GPU name: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

    # Quick computation test
    x = torch.randn(1000, 1000, device='cuda')
    y = torch.matmul(x, x)
    print("GPU computation test: PASSED")

Save as test_gpu.py and run:

python test_gpu.py

Expected output (example):

PyTorch version: 2.1.0
CUDA available: True
CUDA version: 12.1
GPU count: 1
GPU name: NVIDIA A100-SXM4-80GB
GPU memory: 84.9 GB
GPU computation test: PASSED

Understanding GPU Memory Requirements

Different tasks require different GPU memory:

Task Typical GPU Memory
Structure prediction (small protein <200 aa) 8-16 GB
Structure prediction (large protein >500 aa) 32-80 GB
Protein design (RFdiffusion2) 16-32 GB
Docking (DiffDock-PP, PLACER) 8-16 GB
Language model inference (ESM3) 16-40 GB
Binder design (BindCraft) 32-80 GB

If you get out-of-memory errors: 1. Request a GPU with more memory 2. Reduce batch size or sequence length 3. Use CPU offloading if available 4. Process sequences in chunks

Troubleshooting Common Issues

WarningCommon Issues

“CUDA out of memory”

  • Request more GPU memory
  • Reduce batch size
  • Use gradient checkpointing if training

“No CUDA runtime found”

module load cuda/12.1  # Load CUDA module
nvcc --version         # Verify it loaded

“Singularity: command not found”

module load apptainer  # or: module load singularity

Conda environment activation fails in SLURM

Add to your job script:

# Source conda.sh directly (adjust path to your installation)
source /path/to/miniforge3/etc/profile.d/conda.sh
mamba activate my_env

Permission denied on container

chmod +x container.sif