6. OpenFold (Optional)
OpenFold (paper, code) is a faithful, trainable PyTorch reproduction of DeepMind’s AlphaFold2. It achieves performance comparable to AlphaFold2 and provides a fully open-source implementation for protein structure prediction.
Why Use OpenFold?
- Full transparency: Open-source model architecture and training code
- Trainable: Can be fine-tuned or retrained on custom data
- Research-friendly: Ideal for understanding how structure prediction works
- MSA-based accuracy: Uses evolutionary information for high-accuracy predictions
Related Tools: For faster predictions without MSAs, see ESMFold. For a more user-friendly MSA-based option, see LocalColabFold.
Resource Requirements
| Resource | Minimum | Recommended | Notes |
|---|---|---|---|
| GPU RAM | 16 GB | 40+ GB | A100 for large proteins |
| CPU RAM | 32 GB | 64+ GB | MSA generation is memory-intensive |
| Disk Space | 500 GB | 2+ TB | Sequence databases are large |
| CUDA | 11.3+ | 12.1+ | Required for compilation |
Note: OpenFold requires significant disk space for sequence databases if generating MSAs locally. Check if your HPC already has AlphaFold/OpenFold databases installed.
Preparation
Mark as complete
Prerequisites:
- Completed HPC Setup guide
- Conda/Mamba installed
nvccavailable for CUDA compilation- Significant disk space (or access to shared databases)
Check for existing databases:
# Ask your HPC admins or check common locations
ls /shared/databases/alphafold/
ls /shared/databases/openfold/Many HPCs have pre-installed AlphaFold databases that OpenFold can use.
Installation
Mark as complete
Important: OpenFold installation can be complex. The official documentation at openfold.readthedocs.io has the most current instructions.
- Clone the repository:
git clone https://github.com/aqlaboratory/openfold.git
cd openfold- Create the conda environment:
mamba env create -f environment.yml
mamba activate openfold_venvExpected time: 10-20 minutes for environment creation.
- Install OpenFold:
pip install -e .- Download model weights:
bash scripts/download_openfold_params.sh openfold/resourcesExpected download: ~1-2 GB of model weights.
- (Optional) Download sequence databases for MSA generation:
# This downloads ~2TB of data - skip if using HPC shared databases
bash scripts/download_alphafold_dbs.sh /path/to/database/directoryTesting the Installation
Mark as complete
Create a test FASTA file test.fasta:
>test_protein
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
Run a prediction (using pre-computed MSAs or without MSAs for testing):
python run_pretrained_openfold.py \
test.fasta \
/path/to/database/directory \
--output_dir predictions/ \
--config_preset model_1_ptm \
--model_device cuda:0Note: For testing without databases, you can use --use_precomputed_alignments with a directory containing pre-computed MSA files.
Success indicators:
- Command completes without errors
predictions/directory contains PDB files- Output includes confidence metrics (pLDDT, pTM)
Expected runtime: 5-30 minutes depending on MSA availability and protein size.
HPC Job Script
#!/bin/bash
#SBATCH --job-name=openfold
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=08:00:00
#SBATCH --output=%x_%j.out
module load cuda/12.1
# source ~/.bashrc # Source shell profile if needed
mamba activate openfold_venv
cd /path/to/openfold
# Using HPC shared databases
DATABASE_DIR=/shared/databases/alphafold
python run_pretrained_openfold.py \
my_protein.fasta \
$DATABASE_DIR \
--output_dir predictions/ \
--config_preset model_1_ptm \
--model_device cuda:0Usage Examples
Basic prediction with local databases:
python run_pretrained_openfold.py \
input.fasta \
/path/to/databases \
--output_dir output/ \
--config_preset model_1_ptmUsing pre-computed MSAs:
python run_pretrained_openfold.py \
input.fasta \
/path/to/databases \
--use_precomputed_alignments /path/to/msas/ \
--output_dir output/Multiple model presets (ensemble):
for preset in model_1_ptm model_2_ptm model_3_ptm; do
python run_pretrained_openfold.py \
input.fasta \
/path/to/databases \
--config_preset $preset \
--output_dir output_${preset}/
doneModel Presets
| Preset | Description |
|---|---|
model_1_ptm |
Standard model with pTM head |
model_2_ptm |
Alternative model with pTM |
model_3_ptm |
Third model variant |
model_1_multimer_v3 |
For protein complexes |
Understanding the Output
Output directory structure:
predictions/
├── test_protein_model_1_ptm_unrelaxed.pdb # Predicted structure
├── test_protein_model_1_ptm_confidences.json # Confidence scores
└── test_protein_model_1_ptm_timings.json # Runtime statistics
Confidence metrics:
- pLDDT: Per-residue confidence (0-100, higher is better)
- pTM: Predicted TM-score (0-1, >0.8 is confident)
- PAE: Predicted Aligned Error matrix
Database Requirements
If generating MSAs locally, OpenFold needs these databases:
| Database | Size | Purpose |
|---|---|---|
| BFD | ~1.7 TB | Sequence alignments |
| MGnify | ~120 GB | Metagenomic sequences |
| UniRef90 | ~100 GB | Sequence clustering |
| UniRef30 | ~200 GB | HHblits templates |
| PDB70 | ~60 GB | Structure templates |
Total: ~2+ TB
Check HPC shared databases first - most research HPCs have these pre-installed.
Troubleshooting
Compilation errors during install:
# Ensure CUDA toolkit is loaded
module load cuda/12.1
nvcc --version
# Clean and retry
pip uninstall openfold
pip install -e .“Database not found” errors:
- Verify database paths exist
- Check HPC documentation for shared database locations
- Contact HPC admins about AlphaFold database availability
Out of memory:
- Request more GPU memory
- Reduce
--max_recycling_iters - Use gradient checkpointing for training
Slow MSA generation:
- MSA generation is CPU-bound and can take hours
- Use pre-computed MSAs when possible
- Consider using ColabFold’s MMseqs2 server instead
Model weights not found:
# Re-download weights
bash scripts/download_openfold_params.sh openfold/resources
# Verify files exist
ls openfold/resources/*.ptFor Researchers: Training
OpenFold can be retrained or fine-tuned:
python train_openfold.py \
/path/to/training/data \
/path/to/template_mmcif \
/path/to/output \
--config_preset initial_trainingSee the training documentation for details.