6. OpenFold (Optional)

OpenFold (paper, code) is a faithful, trainable PyTorch reproduction of DeepMind’s AlphaFold2. It achieves performance comparable to AlphaFold2 and provides a fully open-source implementation for protein structure prediction.

Why Use OpenFold?

  • Full transparency: Open-source model architecture and training code
  • Trainable: Can be fine-tuned or retrained on custom data
  • Research-friendly: Ideal for understanding how structure prediction works
  • MSA-based accuracy: Uses evolutionary information for high-accuracy predictions

Related Tools: For faster predictions without MSAs, see ESMFold. For a more user-friendly MSA-based option, see LocalColabFold.

Resource Requirements

Resource Minimum Recommended Notes
GPU RAM 16 GB 40+ GB A100 for large proteins
CPU RAM 32 GB 64+ GB MSA generation is memory-intensive
Disk Space 500 GB 2+ TB Sequence databases are large
CUDA 11.3+ 12.1+ Required for compilation

Note: OpenFold requires significant disk space for sequence databases if generating MSAs locally. Check if your HPC already has AlphaFold/OpenFold databases installed.

Preparation

Mark as complete

Prerequisites:

  • Completed HPC Setup guide
  • Conda/Mamba installed
  • nvcc available for CUDA compilation
  • Significant disk space (or access to shared databases)

Check for existing databases:

# Ask your HPC admins or check common locations
ls /shared/databases/alphafold/
ls /shared/databases/openfold/

Many HPCs have pre-installed AlphaFold databases that OpenFold can use.

Installation

Mark as complete

Important: OpenFold installation can be complex. The official documentation at openfold.readthedocs.io has the most current instructions.

  1. Clone the repository:
git clone https://github.com/aqlaboratory/openfold.git
cd openfold
  1. Create the conda environment:
mamba env create -f environment.yml
mamba activate openfold_venv

Expected time: 10-20 minutes for environment creation.

  1. Install OpenFold:
pip install -e .
  1. Download model weights:
bash scripts/download_openfold_params.sh openfold/resources

Expected download: ~1-2 GB of model weights.

  1. (Optional) Download sequence databases for MSA generation:
# This downloads ~2TB of data - skip if using HPC shared databases
bash scripts/download_alphafold_dbs.sh /path/to/database/directory

Testing the Installation

Mark as complete

Create a test FASTA file test.fasta:

>test_protein
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG

Run a prediction (using pre-computed MSAs or without MSAs for testing):

python run_pretrained_openfold.py \
    test.fasta \
    /path/to/database/directory \
    --output_dir predictions/ \
    --config_preset model_1_ptm \
    --model_device cuda:0

Note: For testing without databases, you can use --use_precomputed_alignments with a directory containing pre-computed MSA files.

Success indicators:

  • Command completes without errors
  • predictions/ directory contains PDB files
  • Output includes confidence metrics (pLDDT, pTM)

Expected runtime: 5-30 minutes depending on MSA availability and protein size.

HPC Job Script

#!/bin/bash
#SBATCH --job-name=openfold
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=08:00:00
#SBATCH --output=%x_%j.out

module load cuda/12.1

# source ~/.bashrc # Source shell profile if needed
mamba activate openfold_venv

cd /path/to/openfold

# Using HPC shared databases
DATABASE_DIR=/shared/databases/alphafold

python run_pretrained_openfold.py \
    my_protein.fasta \
    $DATABASE_DIR \
    --output_dir predictions/ \
    --config_preset model_1_ptm \
    --model_device cuda:0

Usage Examples

Basic prediction with local databases:

python run_pretrained_openfold.py \
    input.fasta \
    /path/to/databases \
    --output_dir output/ \
    --config_preset model_1_ptm

Using pre-computed MSAs:

python run_pretrained_openfold.py \
    input.fasta \
    /path/to/databases \
    --use_precomputed_alignments /path/to/msas/ \
    --output_dir output/

Multiple model presets (ensemble):

for preset in model_1_ptm model_2_ptm model_3_ptm; do
    python run_pretrained_openfold.py \
        input.fasta \
        /path/to/databases \
        --config_preset $preset \
        --output_dir output_${preset}/
done

Model Presets

Preset Description
model_1_ptm Standard model with pTM head
model_2_ptm Alternative model with pTM
model_3_ptm Third model variant
model_1_multimer_v3 For protein complexes

Understanding the Output

Output directory structure:

predictions/
├── test_protein_model_1_ptm_unrelaxed.pdb    # Predicted structure
├── test_protein_model_1_ptm_confidences.json # Confidence scores
└── test_protein_model_1_ptm_timings.json     # Runtime statistics

Confidence metrics:

  • pLDDT: Per-residue confidence (0-100, higher is better)
  • pTM: Predicted TM-score (0-1, >0.8 is confident)
  • PAE: Predicted Aligned Error matrix

Database Requirements

If generating MSAs locally, OpenFold needs these databases:

Database Size Purpose
BFD ~1.7 TB Sequence alignments
MGnify ~120 GB Metagenomic sequences
UniRef90 ~100 GB Sequence clustering
UniRef30 ~200 GB HHblits templates
PDB70 ~60 GB Structure templates

Total: ~2+ TB

Check HPC shared databases first - most research HPCs have these pre-installed.

Troubleshooting

WarningCommon Issues

Compilation errors during install:

# Ensure CUDA toolkit is loaded
module load cuda/12.1
nvcc --version

# Clean and retry
pip uninstall openfold
pip install -e .

“Database not found” errors:

  • Verify database paths exist
  • Check HPC documentation for shared database locations
  • Contact HPC admins about AlphaFold database availability

Out of memory:

  • Request more GPU memory
  • Reduce --max_recycling_iters
  • Use gradient checkpointing for training

Slow MSA generation:

  • MSA generation is CPU-bound and can take hours
  • Use pre-computed MSAs when possible
  • Consider using ColabFold’s MMseqs2 server instead

Model weights not found:

# Re-download weights
bash scripts/download_openfold_params.sh openfold/resources

# Verify files exist
ls openfold/resources/*.pt

For Researchers: Training

OpenFold can be retrained or fine-tuned:

python train_openfold.py \
    /path/to/training/data \
    /path/to/template_mmcif \
    /path/to/output \
    --config_preset initial_training

See the training documentation for details.