5. ESMFold

ESMFold (paper, code) is an end-to-end single-sequence structure predictor that uses the ESM-2 language model to generate accurate 3D protein structures directly from sequence, without requiring multiple sequence alignments (MSAs).

Why Use ESMFold?

  • Speed: Significantly faster than AlphaFold2 (seconds vs minutes)
  • No MSA required: Works directly from sequence alone
  • Competitive accuracy: Often comparable to AlphaFold2 for well-folded domains
  • Lower resource usage: Can run on smaller GPUs

Related Tools: For MSA-based prediction with potentially higher accuracy, see LocalColabFold or OpenFold. For protein language model embeddings only, see ESM3.

Resource Requirements

Resource Minimum Recommended Notes
GPU RAM 16 GB 40+ GB Larger proteins need more memory
CPU RAM 16 GB 32 GB CPU-only is possible but slow
Disk Space 5 GB 10 GB Model weights
Python ≤3.9 3.9 Important: Python 3.10+ may have issues

Why Python ≤3.9? ESMFold depends on OpenFold, which has compatibility issues with newer Python versions.

Preparation

Mark as complete

Prerequisites:

  • Completed HPC Setup guide
  • Conda/Mamba installed
  • nvcc available (for compiling OpenFold dependencies)

Verify your environment:

nvcc --version      # Required for OpenFold compilation
module load cuda    # If nvcc not found

Installation

Mark as complete

  1. Create a conda environment with Python 3.9:
mamba create -n esmfold python=3.9
mamba activate esmfold
  1. Install PyTorch (adjust CUDA version to match your cluster):
mamba install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
  1. Install ESM with ESMFold dependencies:
pip install "fair-esm[esmfold]"
  1. Install OpenFold dependencies:
pip install 'dllogger @ git+https://github.com/NVIDIA/dllogger.git'
pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307'

Note: OpenFold compilation requires nvcc. If it fails, verify CUDA toolkit is loaded.

Alternative method (using environment file):

wget https://raw.githubusercontent.com/facebookresearch/esm/main/environment.yml
mamba env create -f environment.yml
mamba activate esmfold

Testing the Installation

Mark as complete

Create a test script test_esmfold.py:

import torch
import esm

# Load ESMFold model
model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()  # Remove .cuda() if using CPU

# Test sequence (65 residues)
sequence = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"

# Run prediction
with torch.no_grad():
    output = model.infer_pdb(sequence)

# Save output
with open("test_result.pdb", "w") as f:
    f.write(output)

print("Structure prediction successful!")
print(f"Output saved to test_result.pdb")
print(f"Sequence length: {len(sequence)} residues")

Run the test:

python test_esmfold.py

Success indicators:

  • Command completes without errors
  • test_result.pdb file is created
  • File contains valid PDB coordinates

Expected runtime: ~10-30 seconds on GPU for this small protein.

HPC Job Script

#!/bin/bash
#SBATCH --job-name=esmfold
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out

module load cuda/12.1

# source ~/.bashrc
mamba activate esmfold

# Predict structures for all sequences in FASTA file
esm-fold -i my_proteins.fasta -o predictions/ \
    --num-recycles 4 \
    --max-tokens-per-batch 1024

Usage Examples

Command line interface:

esm-fold -i sequences.fasta -o output_pdbs/

Key CLI options:

Option Description
-i Input FASTA file
-o Output directory for PDB files
--num-recycles Number of recycles (default: 4)
--max-tokens-per-batch Batch shorter sequences together
--chunk-size Reduce memory (values: 128, 64, 32)
--cpu-only Run on CPU only
--cpu-offload Offload to CPU RAM for long sequences

Reduce memory for large proteins:

esm-fold -i large_proteins.fasta -o output/ --chunk-size 64

Process very long sequences:

esm-fold -i long_sequences.fasta -o output/ --cpu-offload

Python API:

import torch
import esm

# Load model
model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

# Predict structure
sequence = "MVKLTAEGSEVSRQVIVQDIAYLRSLG"
with torch.no_grad():
    pdb_string = model.infer_pdb(sequence)

# Save
with open("prediction.pdb", "w") as f:
    f.write(pdb_string)

Get confidence scores:

import torch
import esm

model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

sequence = "MVKLTAEGSEVSRQVIVQDIAYLRSLG"
with torch.no_grad():
    output = model.infer(sequence)

# Per-residue confidence (pLDDT)
plddt = output["plddt"]  # Shape: (1, L)
print(f"Mean pLDDT: {plddt.mean().item():.2f}")

# Predicted TM-score
ptm = output["ptm"]
print(f"pTM: {ptm.item():.3f}")

Understanding the Output

PDB output:

  • Standard PDB format with predicted coordinates
  • B-factor column contains pLDDT confidence scores (0-100)
  • Higher pLDDT = higher confidence

Confidence score interpretation:

pLDDT Range Interpretation
90-100 Very high confidence
70-90 Confident
50-70 Low confidence (may be disordered)
<50 Very low confidence (likely disordered)

Memory Usage Guide

Approximate GPU memory by sequence length:

Sequence Length GPU Memory Needed
<200 aa 8-16 GB
200-400 aa 16-24 GB
400-600 aa 24-40 GB
600-1000 aa 40-80 GB
>1000 aa Use --cpu-offload

Troubleshooting

WarningCommon Issues

OpenFold installation fails:

  • Verify nvcc is available:

    nvcc --version
    # If not found:
    module load cuda
  • Ensure PyTorch CUDA version matches system CUDA

“CUDA out of memory”:

# Use chunking to reduce memory
esm-fold -i input.fasta -o output/ --chunk-size 64

# Or use CPU offloading for very long sequences
esm-fold -i input.fasta -o output/ --cpu-offload

Slow on GPU (should be fast):

# Verify CUDA is detected
python -c "import torch; print(torch.cuda.is_available())"
# Should print: True

Python version errors:

  • ESMFold requires Python ≤3.9 due to OpenFold dependencies
  • Create a new environment with Python 3.9 if needed

Model download hangs:

  • First run downloads ~2GB of weights

  • Set custom cache location:

    export TORCH_HOME=/scratch/$USER/torch_cache