5. ESMFold
ESMFold (paper, code) is an end-to-end single-sequence structure predictor that uses the ESM-2 language model to generate accurate 3D protein structures directly from sequence, without requiring multiple sequence alignments (MSAs).
Why Use ESMFold?
- Speed: Significantly faster than AlphaFold2 (seconds vs minutes)
- No MSA required: Works directly from sequence alone
- Competitive accuracy: Often comparable to AlphaFold2 for well-folded domains
- Lower resource usage: Can run on smaller GPUs
Related Tools: For MSA-based prediction with potentially higher accuracy, see LocalColabFold or OpenFold. For protein language model embeddings only, see ESM3.
Resource Requirements
| Resource | Minimum | Recommended | Notes |
|---|---|---|---|
| GPU RAM | 16 GB | 40+ GB | Larger proteins need more memory |
| CPU RAM | 16 GB | 32 GB | CPU-only is possible but slow |
| Disk Space | 5 GB | 10 GB | Model weights |
| Python | ≤3.9 | 3.9 | Important: Python 3.10+ may have issues |
Why Python ≤3.9? ESMFold depends on OpenFold, which has compatibility issues with newer Python versions.
Preparation
Mark as complete
Prerequisites:
- Completed HPC Setup guide
- Conda/Mamba installed
nvccavailable (for compiling OpenFold dependencies)
Verify your environment:
nvcc --version # Required for OpenFold compilation
module load cuda # If nvcc not foundInstallation
Mark as complete
- Create a conda environment with Python 3.9:
mamba create -n esmfold python=3.9
mamba activate esmfold- Install PyTorch (adjust CUDA version to match your cluster):
mamba install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia- Install ESM with ESMFold dependencies:
pip install "fair-esm[esmfold]"- Install OpenFold dependencies:
pip install 'dllogger @ git+https://github.com/NVIDIA/dllogger.git'
pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307'Note: OpenFold compilation requires nvcc. If it fails, verify CUDA toolkit is loaded.
Alternative method (using environment file):
wget https://raw.githubusercontent.com/facebookresearch/esm/main/environment.yml
mamba env create -f environment.yml
mamba activate esmfoldTesting the Installation
Mark as complete
Create a test script test_esmfold.py:
import torch
import esm
# Load ESMFold model
model = esm.pretrained.esmfold_v1()
model = model.eval().cuda() # Remove .cuda() if using CPU
# Test sequence (65 residues)
sequence = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"
# Run prediction
with torch.no_grad():
output = model.infer_pdb(sequence)
# Save output
with open("test_result.pdb", "w") as f:
f.write(output)
print("Structure prediction successful!")
print(f"Output saved to test_result.pdb")
print(f"Sequence length: {len(sequence)} residues")Run the test:
python test_esmfold.pySuccess indicators:
- Command completes without errors
test_result.pdbfile is created- File contains valid PDB coordinates
Expected runtime: ~10-30 seconds on GPU for this small protein.
HPC Job Script
#!/bin/bash
#SBATCH --job-name=esmfold
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out
module load cuda/12.1
# source ~/.bashrc
mamba activate esmfold
# Predict structures for all sequences in FASTA file
esm-fold -i my_proteins.fasta -o predictions/ \
--num-recycles 4 \
--max-tokens-per-batch 1024Usage Examples
Command line interface:
esm-fold -i sequences.fasta -o output_pdbs/Key CLI options:
| Option | Description |
|---|---|
-i |
Input FASTA file |
-o |
Output directory for PDB files |
--num-recycles |
Number of recycles (default: 4) |
--max-tokens-per-batch |
Batch shorter sequences together |
--chunk-size |
Reduce memory (values: 128, 64, 32) |
--cpu-only |
Run on CPU only |
--cpu-offload |
Offload to CPU RAM for long sequences |
Reduce memory for large proteins:
esm-fold -i large_proteins.fasta -o output/ --chunk-size 64Process very long sequences:
esm-fold -i long_sequences.fasta -o output/ --cpu-offloadPython API:
import torch
import esm
# Load model
model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()
# Predict structure
sequence = "MVKLTAEGSEVSRQVIVQDIAYLRSLG"
with torch.no_grad():
pdb_string = model.infer_pdb(sequence)
# Save
with open("prediction.pdb", "w") as f:
f.write(pdb_string)Get confidence scores:
import torch
import esm
model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()
sequence = "MVKLTAEGSEVSRQVIVQDIAYLRSLG"
with torch.no_grad():
output = model.infer(sequence)
# Per-residue confidence (pLDDT)
plddt = output["plddt"] # Shape: (1, L)
print(f"Mean pLDDT: {plddt.mean().item():.2f}")
# Predicted TM-score
ptm = output["ptm"]
print(f"pTM: {ptm.item():.3f}")Understanding the Output
PDB output:
- Standard PDB format with predicted coordinates
- B-factor column contains pLDDT confidence scores (0-100)
- Higher pLDDT = higher confidence
Confidence score interpretation:
| pLDDT Range | Interpretation |
|---|---|
| 90-100 | Very high confidence |
| 70-90 | Confident |
| 50-70 | Low confidence (may be disordered) |
| <50 | Very low confidence (likely disordered) |
Memory Usage Guide
Approximate GPU memory by sequence length:
| Sequence Length | GPU Memory Needed |
|---|---|
| <200 aa | 8-16 GB |
| 200-400 aa | 16-24 GB |
| 400-600 aa | 24-40 GB |
| 600-1000 aa | 40-80 GB |
| >1000 aa | Use --cpu-offload |
Troubleshooting
OpenFold installation fails:
Verify
nvccis available:nvcc --version # If not found: module load cudaEnsure PyTorch CUDA version matches system CUDA
“CUDA out of memory”:
# Use chunking to reduce memory
esm-fold -i input.fasta -o output/ --chunk-size 64
# Or use CPU offloading for very long sequences
esm-fold -i input.fasta -o output/ --cpu-offloadSlow on GPU (should be fast):
# Verify CUDA is detected
python -c "import torch; print(torch.cuda.is_available())"
# Should print: TruePython version errors:
- ESMFold requires Python ≤3.9 due to OpenFold dependencies
- Create a new environment with Python 3.9 if needed
Model download hangs:
First run downloads ~2GB of weights
Set custom cache location:
export TORCH_HOME=/scratch/$USER/torch_cache