6. OpenFold (Optional)

OpenFold (paper, code) is a faithful, trainable PyTorch reproduction of DeepMind’s AlphaFold2. It achieves performance comparable to AlphaFold2 and provides a fully open-source implementation for protein structure prediction.

Why Use OpenFold?

Full transparency: Open-source model architecture and training code
Trainable: Can be fine-tuned or retrained on custom data
Research-friendly: Ideal for understanding how structure prediction works
MSA-based accuracy: Uses evolutionary information for high-accuracy predictions

Related Tools: For faster predictions without MSAs, see ESMFold. For a more user-friendly MSA-based option, see LocalColabFold.

Resource Requirements

Resource	Minimum	Recommended	Notes
GPU RAM	16 GB	40+ GB	A100 for large proteins
CPU RAM	32 GB	64+ GB	MSA generation is memory-intensive
Disk Space	500 GB	2+ TB	Sequence databases are large
CUDA	11.3+	12.1+	Required for compilation

Note: OpenFold requires significant disk space for sequence databases if generating MSAs locally. Check if your HPC already has AlphaFold/OpenFold databases installed.

Preparation

Mark as complete

Prerequisites:

Completed HPC Setup guide
Conda/Mamba installed
nvcc available for CUDA compilation
Significant disk space (or access to shared databases)

Check for existing databases:

# Ask your HPC admins or check common locations
ls /shared/databases/alphafold/
ls /shared/databases/openfold/

Many HPCs have pre-installed AlphaFold databases that OpenFold can use.

Installation

Mark as complete

Important: OpenFold installation can be complex. The official documentation at openfold.readthedocs.io has the most current instructions.

Clone the repository:

git clone https://github.com/aqlaboratory/openfold.git
cd openfold

Create the conda environment:

mamba env create -f environment.yml
mamba activate openfold_venv

Expected time: 10-20 minutes for environment creation.

Install OpenFold:

pip install -e .

Download model weights:

bash scripts/download_openfold_params.sh openfold/resources

Expected download: ~1-2 GB of model weights.

(Optional) Download sequence databases for MSA generation:

# This downloads ~2TB of data - skip if using HPC shared databases
bash scripts/download_alphafold_dbs.sh /path/to/database/directory

Testing the Installation

Mark as complete

Create a test FASTA file test.fasta:

>test_protein
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG

Run a prediction (using pre-computed MSAs or without MSAs for testing):

python run_pretrained_openfold.py \
    test.fasta \
    /path/to/database/directory \
    --output_dir predictions/ \
    --config_preset model_1_ptm \
    --model_device cuda:0

Note: For testing without databases, you can use --use_precomputed_alignments with a directory containing pre-computed MSA files.

Success indicators:

Command completes without errors
predictions/ directory contains PDB files
Output includes confidence metrics (pLDDT, pTM)

Expected runtime: 5-30 minutes depending on MSA availability and protein size.

HPC Job Script

#!/bin/bash
#SBATCH --job-name=openfold
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=08:00:00
#SBATCH --output=%x_%j.out

module load cuda/12.1

# source ~/.bashrc # Source shell profile if needed
mamba activate openfold_venv

cd /path/to/openfold

# Using HPC shared databases
DATABASE_DIR=/shared/databases/alphafold

python run_pretrained_openfold.py \
    my_protein.fasta \
    $DATABASE_DIR \
    --output_dir predictions/ \
    --config_preset model_1_ptm \
    --model_device cuda:0

Usage Examples

Basic prediction with local databases:

python run_pretrained_openfold.py \
    input.fasta \
    /path/to/databases \
    --output_dir output/ \
    --config_preset model_1_ptm

Using pre-computed MSAs:

python run_pretrained_openfold.py \
    input.fasta \
    /path/to/databases \
    --use_precomputed_alignments /path/to/msas/ \
    --output_dir output/

Multiple model presets (ensemble):

for preset in model_1_ptm model_2_ptm model_3_ptm; do
    python run_pretrained_openfold.py \
        input.fasta \
        /path/to/databases \
        --config_preset $preset \
        --output_dir output_${preset}/
done

Model Presets

Preset	Description
`model_1_ptm`	Standard model with pTM head
`model_2_ptm`	Alternative model with pTM
`model_3_ptm`	Third model variant
`model_1_multimer_v3`	For protein complexes

Understanding the Output

Output directory structure:

predictions/
├── test_protein_model_1_ptm_unrelaxed.pdb    # Predicted structure
├── test_protein_model_1_ptm_confidences.json # Confidence scores
└── test_protein_model_1_ptm_timings.json     # Runtime statistics

Confidence metrics:

pLDDT: Per-residue confidence (0-100, higher is better)
pTM: Predicted TM-score (0-1, >0.8 is confident)
PAE: Predicted Aligned Error matrix

Database Requirements

If generating MSAs locally, OpenFold needs these databases:

Database	Size	Purpose
BFD	~1.7 TB	Sequence alignments
MGnify	~120 GB	Metagenomic sequences
UniRef90	~100 GB	Sequence clustering
UniRef30	~200 GB	HHblits templates
PDB70	~60 GB	Structure templates

Total: ~2+ TB

Check HPC shared databases first - most research HPCs have these pre-installed.

Troubleshooting

Common Issues

Compilation errors during install:

# Ensure CUDA toolkit is loaded
module load cuda/12.1
nvcc --version

# Clean and retry
pip uninstall openfold
pip install -e .

“Database not found” errors:

Verify database paths exist
Check HPC documentation for shared database locations
Contact HPC admins about AlphaFold database availability

Out of memory:

Request more GPU memory
Reduce --max_recycling_iters
Use gradient checkpointing for training

Slow MSA generation:

MSA generation is CPU-bound and can take hours
Use pre-computed MSAs when possible
Consider using ColabFold’s MMseqs2 server instead

Model weights not found:

# Re-download weights
bash scripts/download_openfold_params.sh openfold/resources

# Verify files exist
ls openfold/resources/*.pt

For Researchers: Training

OpenFold can be retrained or fine-tuned:

python train_openfold.py \
    /path/to/training/data \
    /path/to/template_mmcif \
    /path/to/output \
    --config_preset initial_training

See the training documentation for details.