1. Common HPC Setup
Before installing individual ML tools, ensure your HPC environment is properly configured. This guide covers the foundational setup that all subsequent modules depend on.
Resource Requirements Overview
Most ML protein tools share similar computational requirements. Here’s a general guide:
| Resource | Minimum | Recommended | Notes |
|---|---|---|---|
| GPU RAM | 16 GB | 40+ GB | A100 80GB ideal for large proteins |
| CPU RAM | 32 GB | 64 GB | More for MSA generation |
| Disk Space | 50 GB | 200+ GB | Model weights + databases |
| CUDA | 11.6+ | 12.1+ | Check tool-specific requirements |
Checking Your HPC Environment
Mark as complete
Many HPC clusters do not have internet access on compute nodes (the nodes where your heavy jobs run). They often only have internet on “login” or “head” nodes.
- Downloads: Always run installation and download commands on a login node.
- Execution: When running jobs, ensure your tools don’t try to download models on the fly. Pre-download all weights and databases.
1. Check Available CUDA Modules
module avail cudaThis shows all CUDA versions installed on your cluster. Note the versions - you’ll need to match them to tool requirements.
2. Check GPU Availability
Request an interactive GPU session:
# SLURM example
srun --gpus=1 --pty bashThen check GPU status:
nvidia-smiThis shows: - GPU model (A100, V100, RTX 4090, etc.) - GPU memory (important for large models) - Current CUDA driver version
3. Check CUDA Toolkit Version
nvcc --versionIf this fails, load a CUDA module first:
module load cuda/12.1
nvcc --versionConda/Mamba Setup
Mark as complete
Most tools use Conda environments. Mamba is recommended as it’s significantly faster than Conda for dependency resolution.
Installing Mamba (if not available)
If your HPC doesn’t have Mamba, install Miniforge:
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.shFollow prompts, then restart your shell or run:
source ~/.bashrcBest Practices for HPC Conda Usage
- Use dedicated environment directories: Set environment location to avoid filling home directory quota:
# Add to ~/.condarc
envs_dirs:
- /scratch/$USER/conda_envs
pkgs_dirs:
- /scratch/$USER/conda_pkgsOne environment per tool: Don’t try to install all tools in one environment - dependency conflicts are common.
Export environments for reproducibility:
mamba env export > environment.ymlDocker vs Singularity/Apptainer
Mark as complete
IMPORTANT: Most academic HPCs do NOT support Docker for security reasons. Use Singularity or Apptainer instead.
Loading Container Runtime
module load apptainer
# or on older systems:
module load singularityConverting Docker Commands to Apptainer
Many tool READMEs show Docker commands. Here’s how to translate them:
| Docker Command | Apptainer Equivalent |
|---|---|
docker run |
apptainer run |
docker run --gpus all |
apptainer run --nv |
docker run -v /path:/path |
apptainer run --bind /path:/path |
docker pull image:tag |
apptainer pull docker://image:tag |
Example conversion:
# Docker (won't work on HPC):
docker run --gpus all -v $(pwd):/workspace myimage:latest python script.py
# Apptainer (works on HPC):
apptainer run --nv --bind $(pwd):/workspace myimage.sif python script.pyPulling Docker Images as Singularity Files
apptainer pull docker://nvcr.io/nvidia/pytorch:23.10-py3
# Creates: pytorch_23.10-py3.sifSLURM Job Submission Basics
Mark as complete
Most HPCs use SLURM for job scheduling. Here’s a template for ML jobs:
#!/bin/bash
#SBATCH --job-name=my_ml_job
#SBATCH --partition=gpu # GPU partition name (varies by cluster)
#SBATCH --gpus=1 # Number of GPUs
#SBATCH --cpus-per-task=8 # CPUs for data loading
#SBATCH --mem=64G # RAM
#SBATCH --time=04:00:00 # Wall time (HH:MM:SS)
#SBATCH --output=%x_%j.out # Output file (%x=job name, %j=job ID)
#SBATCH --error=%x_%j.err # Error file
# Load required modules
module load cuda/12.1
module load apptainer
# Activate conda environment
# source ~/.bashrc # Source your shell profile if needed
source /path/to/your/miniconda3/etc/profile.d/conda.sh # Better: source conda.sh directly
mamba activate my_env
# Run your command
python my_script.pyCommon SLURM Commands
| Command | Description |
|---|---|
sbatch script.sh |
Submit job |
squeue -u $USER |
Check your jobs |
scancel JOB_ID |
Cancel a job |
sinfo |
Show partition info |
sacct -j JOB_ID |
Job accounting info |
GPU Partition Names
GPU partition names vary by cluster. Common names:
gpu,gpus,gpu-shareda100,v100,rtxgpu-debug(for testing)
Check your cluster’s documentation or run sinfo to see available partitions.
Environment Variables
Mark as complete
Several tools use environment variables. Add these to your ~/.bashrc:
# Model weight storage (prevents filling home directory)
export TORCH_HOME=/scratch/$USER/torch_cache
export HF_HOME=/scratch/$USER/huggingface_cache
export TRANSFORMERS_CACHE=/scratch/$USER/transformers_cache
# ColabFold databases
export COLABFOLD_DOWNLOAD_DIR=/scratch/$USER/colabfold_db
# Chai-1 models
export CHAI_DOWNLOADS_DIR=/scratch/$USER/chai_models
# General cache
export XDG_CACHE_HOME=/scratch/$USER/.cacheReplace /scratch/$USER with your cluster’s scratch or work directory path.
Verifying GPU Works with PyTorch
Mark as complete
After setting up an environment with PyTorch, verify GPU access:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
print(f"GPU name: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# Quick computation test
x = torch.randn(1000, 1000, device='cuda')
y = torch.matmul(x, x)
print("GPU computation test: PASSED")Save as test_gpu.py and run:
python test_gpu.pyExpected output (example):
PyTorch version: 2.1.0
CUDA available: True
CUDA version: 12.1
GPU count: 1
GPU name: NVIDIA A100-SXM4-80GB
GPU memory: 84.9 GB
GPU computation test: PASSED
Understanding GPU Memory Requirements
Different tasks require different GPU memory:
| Task | Typical GPU Memory |
|---|---|
| Structure prediction (small protein <200 aa) | 8-16 GB |
| Structure prediction (large protein >500 aa) | 32-80 GB |
| Protein design (RFdiffusion2) | 16-32 GB |
| Docking (DiffDock-PP, PLACER) | 8-16 GB |
| Language model inference (ESM3) | 16-40 GB |
| Binder design (BindCraft) | 32-80 GB |
If you get out-of-memory errors: 1. Request a GPU with more memory 2. Reduce batch size or sequence length 3. Use CPU offloading if available 4. Process sequences in chunks
Troubleshooting Common Issues
“CUDA out of memory”
- Request more GPU memory
- Reduce batch size
- Use gradient checkpointing if training
“No CUDA runtime found”
module load cuda/12.1 # Load CUDA module
nvcc --version # Verify it loaded“Singularity: command not found”
module load apptainer # or: module load singularityConda environment activation fails in SLURM
Add to your job script:
# Source conda.sh directly (adjust path to your installation)
source /path/to/miniforge3/etc/profile.d/conda.sh
mamba activate my_envPermission denied on container
chmod +x container.sif