flowchart LR
Seq[Input Sequence] --> MSA[MSA Generation]
Seq --> Templ[Template Search]
MSA --> Evo[Evoformer]
Templ --> Evo
Evo --> Struct[Structure Module]
Struct --> Coord[3D Coordinates]
subgraph "Iterative Refinement (Recycling)"
Evo
Struct
end
style Evo fill:#f9f,stroke:#333,stroke-width:2px
style Struct fill:#bbf,stroke:#333,stroke-width:2px
3. AlphaFold2 and OpenFold
This module dives deep into AlphaFold2—the breakthrough model that essentially “solved” the protein structure prediction problem—and OpenFold, its open-source, trainable implementation.
Live Workshop Session
📊 View slide deck
The AlphaFold2 Breakthrough
CASP14: A Watershed Moment
At the 14th Critical Assessment of protein Structure Prediction (CASP14) in 2020, AlphaFold2 achieved what many thought was impossible:
- GDT-TS scores of ~90 on targets where the previous state-of-the-art was ~60
- Near-experimental accuracy for many proteins
- Consistent performance across diverse protein families
This wasn’t incremental improvement—it was a paradigm shift.
To put this in perspective: before AlphaFold2, structure prediction was considered one of biology’s grand challenges. Some estimated it would take decades more to solve. AlphaFold2 essentially closed this chapter.
AlphaFold2 vs OpenFold
AlphaFold2 (DeepMind): - Original implementation in JAX - Released weights and inference code - Not easily trainable by the community
OpenFold (Columbia/Harvard): - Faithful PyTorch reproduction - Fully trainable on new data - Community-friendly and extensible - 3-5x faster for most proteins - Lower memory usage: Can predict longer proteins on single GPUs
For this bootcamp, we’ll use ColabFold, which combines AlphaFold2’s models with fast MSA generation from MMseqs2.
Reference: Ahdritz et al. (2024) - OpenFold paper
How AlphaFold2 Works
High-Level Architecture
AlphaFold2’s architecture flows from the sequence to the final 3D structure through several specialized modules.
Inputs and Outputs
Inputs: 1. Query sequence: The protein you want to predict 2. Multiple Sequence Alignment (MSA): Related sequences found by database search 3. Templates (optional): Known structures of homologous proteins
Outputs: 1. 3D coordinates: Atomic positions for all residues 2. pLDDT scores: Per-residue confidence (0-100) 3. pTM score: Overall structure confidence 4. PAE matrix: Predicted Aligned Error between residue pairs
The MSA: Why Evolutionary Information Matters
The Multiple Sequence Alignment is arguably the most important input to AlphaFold2. It aligns your query sequence with evolutionarily related sequences to find co-evolution patterns.
Query: MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELAL
Homolog1: MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELAL
Homolog2: MKVLWGALLVTFLAGCQAKIEQAVETEPEPELRQQTEWQSGQRWDLAL
Homolog3: MKILWAALLVSFLAGCQAKVEQAVEAEPEPELRQQTEWQSGQRWELAL
** **.****:******* :*****.**************.*:***
AlphaFold2’s accuracy depends heavily on MSA quality. Proteins with few homologs (orphan proteins, designed proteins) are harder to predict because there’s less evolutionary information to leverage.
How MSAs are Generated
AlphaFold2 uses multiple database search tools to build comprehensive MSAs:
JackHMMER (iterative profile HMM search):
- Searches UniRef90, MGnify, and other databases
- Iteratively builds a profile from hits and re-searches
- Highly sensitive but computationally expensive
- Used for the “genetic” MSA in AlphaFold2
HHBlits (HMM-HMM search):
- Searches clustered databases like BFD (Big Fantastic Database)
- Faster than JackHMMER with comparable sensitivity
- Used for additional MSA depth
MMseqs2 (ColabFold’s approach):
- 100-1000x faster than JackHMMER
- Searches pre-computed ColabFold databases
- Slight accuracy trade-off for massive speed gains
- Makes AlphaFold2 practical for large-scale predictions
The number of effective sequences (Neff) in an MSA correlates strongly with prediction accuracy:
| Neff | Expected Accuracy |
|---|---|
| >1000 | High confidence predictions likely |
| 100-1000 | Good predictions for most proteins |
| 30-100 | Predictions may be unreliable in some regions |
| <30 | Significant uncertainty; consider single-sequence methods |
Neff accounts for sequence redundancy—100 nearly identical sequences contribute less information than 100 diverse sequences.
Co-evolution: The Key to Structure Prediction
The fundamental insight behind using MSAs for structure prediction is co-evolution: residues that are in physical contact in the 3D structure tend to mutate together during evolution to maintain their interaction.
flowchart LR
subgraph "In 3D Structure"
A[Residue i<br/>Asp⁻] <--> B[Residue j<br/>Lys⁺]
end
subgraph "During Evolution"
C[Asp⁻ → Lys⁺] --> D[Lys⁺ → Asp⁻]
end
A --> C
B --> D
style A fill:#ffcccc
style B fill:#ccccff
Why does this happen?
Consider two residues forming a salt bridge (Asp⁻ interacting with Lys⁺):
- If position i mutates from Asp to Lys (negative to positive charge)
- The interaction is disrupted—the protein may misfold
- Unless position j compensates by mutating from Lys to Asp
- The complementary mutation restores the interaction
This creates correlated mutations that we can detect statistically across thousands of homologous sequences.
From Co-evolution to Contacts: Direct Coupling Analysis
Before deep learning, methods like Direct Coupling Analysis (DCA) extracted contact predictions from MSAs:
flowchart TD
MSA[Multiple Sequence Alignment] --> MI[Mutual Information<br/>Raw correlations]
MI --> Problem[Problem: Transitive correlations<br/>A↔B and B↔C implies spurious A↔C]
Problem --> DCA[Direct Coupling Analysis<br/>Separates direct from indirect]
DCA --> Contacts[Contact Predictions]
style Problem fill:#ffcccc
style DCA fill:#ccffcc
The challenge: If residue A co-evolves with B, and B co-evolves with C, simple correlation analysis will show spurious A↔︎C co-evolution even if they never contact.
DCA’s solution: Use inverse covariance (precision) matrices or pseudolikelihood methods to separate direct couplings from indirect (transitive) correlations.
AlphaFold2 doesn’t explicitly compute DCA—instead, its attention mechanisms learn to extract these signals directly from the MSA representation.
How AlphaFold2 Processes the MSA
The MSA enters AlphaFold2 as a tensor with dimensions:
- Sequences: Number of aligned sequences (typically ~500-5000, subsampled from full MSA)
- Residues: Length of the query sequence
- Features: One-hot encoding + positional features (22 dimensions per residue)
The model processes this through two parallel representations:
flowchart TB
MSA[MSA Input<br/>N_seq × L × 22] --> MSArep[MSA Representation<br/>N_seq × L × 256]
MSA --> Pair[Pair Representation<br/>L × L × 128]
MSArep <--> Evo[Evoformer<br/>48 blocks]
Pair <--> Evo
Evo --> Final[To Structure Module]
style Evo fill:#f9f,stroke:#333,stroke-width:2px
- MSA Representation: Tracks information about each residue in each sequence
- Pair Representation: Tracks relationships between all pairs of residues in the query
The Evoformer: Learning from Evolution
The Evoformer is AlphaFold2’s core innovation—a specialized neural network architecture that processes evolutionary information to understand protein structure. It consists of 48 blocks, each updating two interconnected representations.
The Two Representations
The Evoformer maintains and iteratively refines two tensors:
1. MSA Representation (N_seq × L × 256):
- Each entry represents one residue in one sequence
- Captures per-position information across evolutionary history
- Updated by row and column attention
2. Pair Representation (L × L × 128):
- Each entry represents the relationship between two residues
- Encodes distance, orientation, and contact information
- Updated by triangle attention and outer product mean
flowchart TB
subgraph "Evoformer Block (×48)"
subgraph "MSA Stack"
Row[Row-wise Attention<br/>with pair bias]
Col[Column-wise Attention]
end
subgraph "Communication"
OPM[Outer Product Mean<br/>MSA → Pair]
end
subgraph "Pair Stack"
TriStart[Triangle Attention<br/>Starting Node]
TriEnd[Triangle Attention<br/>Ending Node]
TriOut[Triangle Multiplicative<br/>Outgoing]
TriIn[Triangle Multiplicative<br/>Incoming]
Trans[Transition Layer]
end
end
Row --> Col
Col --> OPM
OPM --> TriStart
TriStart --> TriEnd
TriEnd --> TriOut
TriOut --> TriIn
TriIn --> Trans
style OPM fill:#ffd700
Row-wise Attention: Within-Sequence Reasoning
Row attention operates along the residue dimension for each sequence in the MSA:
For each sequence s in the MSA:
Query, Key, Value = Linear projections of MSA representation
Attention weights = softmax(Q·K^T / √d + pair_bias)
Output = Attention weights · V
Key insight: The attention weights are biased by the pair representation. This means the model’s understanding of residue relationships directly influences how it processes each sequence.
What this achieves:
- Residues that are structurally related attend to each other
- Long-range dependencies are captured regardless of sequence distance
- The pair bias acts like a “structural prior” during MSA processing
Column-wise Attention: Cross-Sequence Reasoning
Column attention operates across sequences at each position:
For each position i in the sequence:
Compare how position i appears across all sequences
Learn patterns of conservation and variation
What this achieves:
- Identifies conserved residues (important for function/structure)
- Detects co-varying positions (co-evolution signal)
- Aggregates information from thousands of evolutionary samples
- Row attention: “What other positions in this sequence are relevant to position i?”
- Column attention: “What can I learn about position i by looking at how it varies across evolution?”
The Outer Product Mean: Bridging MSA and Pairs
The Outer Product Mean is the critical operation that transfers co-evolution information from the MSA representation into the pair representation:
For positions i and j:
pair_update[i,j] = Mean over sequences s of (MSA[s,i] ⊗ MSA[s,j])
Where ⊗ denotes the outer product.
Intuition: If positions i and j consistently appear together in certain amino acid combinations across many sequences, this outer product will capture that correlation and inject it into the pair representation.
This is how AlphaFold2 “learns” co-evolution: The outer product implicitly computes correlation statistics between positions, similar to what DCA does explicitly—but in a learnable, differentiable way.
Triangle Updates: Enforcing Geometric Consistency
The triangle updates are what make the pair representation geometrically consistent. They’re based on a simple principle: distances must satisfy the triangle inequality.
flowchart LR
subgraph "Triangle Inequality"
A((i)) --- B((j))
B --- C((k))
A --- C
end
subgraph "Constraint"
D[d_ij + d_jk ≥ d_ik]
end
If we know the relationships i↔︎k and k↔︎j, we can infer something about i↔︎j.
Triangle Attention (Starting and Ending Node):
- “Starting node”: For edge i→j, attend over all edges i→k that share the starting node
- “Ending node”: For edge i→j, attend over all edges k→j that share the ending node
Triangle attention (starting):
For edge (i,j): aggregate information from all (i,k) edges
Triangle attention (ending):
For edge (i,j): aggregate information from all (k,j) edges
Triangle Multiplicative Updates (Outgoing and Incoming):
These use multiplicative gating to combine information:
Triangle multiplicative (outgoing):
pair[i,j] += Σ_k gate(pair[i,k]) × pair[k,j]
Triangle multiplicative (incoming):
pair[i,j] += Σ_k pair[i,k] × gate(pair[k,j])
Why both attention AND multiplicative? They capture different types of relationships:
- Attention: Soft selection of which intermediate nodes k matter
- Multiplicative: Direct combination of path information
Think of the pair representation as a fully-connected graph where each edge (i,j) stores information. Triangle updates pass messages along triangles, enforcing that if A is close to B and B is close to C, this constrains the A-C relationship.
Evoformer Summary
After 48 blocks, the Evoformer has:
- Extracted co-evolutionary signals from the MSA via column attention and outer product mean
- Propagated structural information through row attention with pair bias
- Enforced geometric consistency through triangle updates
- Built a rich pair representation encoding likely 3D relationships
The Structure Module: From Representations to 3D Coordinates
The Structure Module takes the refined representations from the Evoformer and converts them into actual 3D atomic coordinates. This is where AlphaFold2 transitions from “reasoning about structure” to “predicting structure.”
The Key Challenge: SE(3) Equivariance
Protein structure prediction has a fundamental requirement: the prediction should be independent of how we orient the input.
- If we rotate the coordinate frame, the predicted structure should rotate identically
- If we translate the origin, distances should be preserved
- The prediction should depend only on intrinsic properties, not arbitrary reference frames
This is called SE(3) equivariance (Special Euclidean group in 3D = rotations + translations).
The problem: Standard neural networks don’t naturally respect this symmetry. A naive network would give different predictions for the same protein oriented differently.
Frames: The Backbone Representation
AlphaFold2 represents each residue using a local coordinate frame (also called a “rigid body” or “frame”):
flowchart LR
subgraph "Residue Frame"
O[Origin at Cα] --> X[x-axis: N→Cα direction]
O --> Y[y-axis: perpendicular in plane]
O --> Z[z-axis: completes right-hand system]
end
Each frame consists of:
- Translation (t): The 3D position of the Cα atom
- Rotation (R): A 3×3 rotation matrix defining the local orientation
The backbone is thus represented as a sequence of frames: T₁, T₂, …, T_L
Why frames? They naturally encode:
- Position (where is this residue?)
- Orientation (how is the peptide bond oriented?)
- Relative geometry (how do two residues relate in space?)
Invariant Point Attention (IPA): The Core Innovation
Invariant Point Attention is the structure module’s key mechanism. It’s a modified attention operation that:
- Is invariant to global rotations and translations
- Can reason about 3D geometric relationships
- Uses both the pair representation and current 3D coordinates
flowchart TB
subgraph "IPA Inputs"
Single[Single Representation<br/>per-residue features]
Pair[Pair Representation<br/>from Evoformer]
Frames[Current Frames<br/>T_i for each residue]
end
subgraph "IPA Computation"
QKV[Generate Q, K, V<br/>in local frames]
Points[Generate query/key points<br/>in local coordinates]
Transform[Transform points to<br/>global frame]
Dist[Compute point-to-point<br/>distances]
Attn[Attention with<br/>pair bias + distance bias]
end
Single --> QKV
Frames --> Points
Points --> Transform
Transform --> Dist
Pair --> Attn
Dist --> Attn
QKV --> Attn
Attn --> Output[Updated Single<br/>Representation]
The “invariant points” trick:
- Each residue generates “query points” and “key points” in its local coordinate frame
- These points are transformed to the global frame using the current frame estimate
- Distances between points are computed (distances are invariant to rotation/translation!)
- These distances bias the attention weights
Why this works: By working with distances between transformed points, IPA can reason about 3D geometry while remaining invariant to the choice of global reference frame.
The Frame Update: Predicting Backbone Geometry
After each IPA layer, the structure module updates the residue frames:
For each residue i:
quaternion, translation = MLP(single_representation[i])
T_i = T_i ∘ (quaternion_to_rotation(quaternion), translation)
Key insight: Updates are applied as compositions with the current frame, not absolute predictions. This makes learning easier—the network only needs to predict small refinements.
The frames start from an initial guess (often just the identity—all residues at the origin with no rotation) and are iteratively refined.
Side Chain Prediction: Torsion Angles
The structure module also predicts side chain conformations using torsion (dihedral) angles:
backbone torsion angles: φ, ψ, ω
side chain torsion angles: χ₁, χ₂, χ₃, χ₄ (depending on amino acid)
Each angle is predicted as:
(sin(angle), cos(angle)) = MLP(single_representation)
angle = atan2(sin, cos)
Why sin/cos instead of the angle directly? Angles wrap around (0° = 360°), which creates discontinuities. Predicting (sin, cos) avoids this problem.
From Frames to Atoms
Once we have frames and torsion angles, computing atom positions is deterministic:
For each residue:
1. Place backbone atoms (N, Cα, C, O) using the frame
2. Place Cβ using ideal bond geometry
3. Rotate each side chain bond by predicted χ angles
4. All atom positions fall out of the geometry
This uses idealized bond lengths and angles from chemistry—the network only predicts the “free” degrees of freedom (frame rotations/translations and torsion angles).
Structure Module Architecture Summary
flowchart TB
EvoOut[Evoformer Output<br/>Single + Pair repr.] --> Init[Initialize Frames<br/>Identity transforms]
subgraph "Structure Module (8 layers)"
Init --> IPA1[IPA Layer 1]
IPA1 --> Update1[Frame Update]
Update1 --> IPA2[IPA Layer 2]
IPA2 --> Update2[Frame Update]
Update2 --> Dots[...]
Dots --> IPA8[IPA Layer 8]
IPA8 --> Update8[Frame Update]
end
Update8 --> Torsion[Predict Torsion Angles]
Update8 --> Final[Final Frames]
Final --> Atoms[Compute Atom<br/>Positions]
Torsion --> Atoms
style IPA1 fill:#bbf
style IPA8 fill:#bbf
Recycling: Iterative Refinement
AlphaFold2 doesn’t just run once—it recycles its predictions back as input for multiple rounds of refinement.
flowchart LR
subgraph "Recycle 1"
E1[Evoformer] --> S1[Structure Module]
end
subgraph "Recycle 2"
E2[Evoformer] --> S2[Structure Module]
end
subgraph "Recycle 3"
E3[Evoformer] --> S3[Structure Module]
end
S1 -->|"Pair repr + structure"| E2
S2 -->|"Pair repr + structure"| E3
S3 --> Final[Final Prediction]
What gets recycled?
- Pair representation from previous iteration
- Predicted structure (frames) from structure module
- First row of MSA representation (query sequence features)
Why recycle?
- The Evoformer can use structural information to better interpret the MSA
- Iterative refinement allows progressive improvement
- Similar to how Rosetta uses “iterative assembly” in fragment-based prediction
Typical recycling: 3 rounds (default), up to 20 for difficult targets
Each recycle can be thought of as a step of optimization. The structure progressively “folds” from a random initial state toward the predicted final structure. This is somewhat analogous to molecular dynamics or energy minimization.
Key Parameters and Settings
Recycling Parameters
| Parameter | What it does | Typical values |
|---|---|---|
num_recycles |
How many refinement passes | 3-20 (more = better but slower) |
max_recycles |
Hard cap for recycling | 20 |
recycle_early_stop_tolerance |
Stop if structure converges | 0.5 Å |
When to increase recycles: - Large proteins (>500 residues) - Multi-domain proteins - When initial predictions look uncertain
MSA Settings
| Parameter | What it does | Options |
|---|---|---|
msa_mode |
How to generate MSA | mmseqs2 (fast), jackhmmer (thorough) |
pair_mode |
For multimers: how to pair sequences | paired, unpaired, paired+unpaired |
use_msa |
Whether to use MSA at all | true/false |
use_templates |
Whether to search for structural templates | true/false |
You can run AlphaFold2 without an MSA (use_msa=false), but accuracy drops significantly. Use this only for: - Designed proteins with no natural homologs - Quick preliminary scans - When comparing to ESMFold
Model Selection
| Parameter | What it does | Options |
|---|---|---|
model_type |
Monomer vs multimer | monomer, monomer_ptm, multimer |
num_models |
Number of model versions to run | 1-5 |
rank_by |
How to select best prediction | plddt, ptm, iptm+ptm |
The 5 models: AlphaFold2 was trained with 5 different random initializations. Running all 5 provides: - Ensemble diversity (different predictions) - Uncertainty estimation (do they agree?) - Better chance of finding the best structure
Relaxation
| Parameter | What it does |
|---|---|
use_amber |
Run Amber energy minimization |
use_gpu_relax |
Use GPU for relaxation (faster) |
Why relax? - Fixes minor clashes and bad geometry - Makes structures more physically realistic - Important for downstream applications (docking, MD)
AlphaFold2 Extensions
AlphaFold-Multimer
Predicting protein complexes (multiple chains interacting):
Key modifications: - Cross-chain MSA pairing for evolutionary signal - Losses account for chain permutation symmetry - Interface-aware training
# In your FASTA file, separate chains with ":"
>complex
SEQUENCEOFCHAINA:SEQUENCEOFCHAINBAlphaFold Database
DeepMind released predictions for 200+ million proteins—essentially all of UniProt.
- Access: alphafold.ebi.ac.uk
- Coverage: Human proteome + 47 other key organisms
- Use case: Check if your protein has already been predicted!
AF-Cluster
Idea: Proteins can have multiple conformations. Can we bias AlphaFold2 toward different states?
Approach: - Cluster the MSA into subgroups - Run predictions with different MSA subsets - Different clusters may yield different conformational states
Reference: Wayment-Steele et al. (2022)
AFsample
Idea: Generate diverse predictions through sampling.
Approach: - Enable dropout during inference - Increase recycling - Generate multiple diverse predictions
This is useful for understanding conformational flexibility and uncertainty.
Automated Workflows
Tools like EvoPro and BindCraft use AlphaFold2 as part of larger design pipelines:
- Generate candidate designs
- Predict structures with AF2
- Score and filter based on confidence
- Iterate to improve designs
Understanding the Output
pLDDT: Per-Residue Confidence
pLDDT (predicted Local Distance Difference Test) ranges from 0-100:
| pLDDT | Interpretation | What to do |
|---|---|---|
| >90 | Very high confidence | Trust the local structure |
| 70-90 | Confident | Generally reliable |
| 50-70 | Low confidence | Treat with caution |
| <50 | Very low confidence | Likely disordered or wrong |
Low pLDDT regions often correspond to: - Intrinsically disordered regions (genuinely unstructured) - Flexible loops (multiple conformations possible) - Crystal contacts (structure depends on environment)
Low confidence is information, not failure!
pTM and ipTM: Global Confidence
- pTM (predicted TM-score): Overall structure confidence (0-1)
- ipTM (interface pTM): Confidence in interface prediction (for multimers)
| Score | Interpretation |
|---|---|
| >0.8 | High confidence |
| 0.5-0.8 | Moderate confidence |
| <0.5 | Low confidence |
PAE: Predicted Aligned Error
The PAE matrix shows confidence in relative positions between residue pairs.
Reading PAE plots: - Blue/low values: Confident in relative position - Red/high values: Uncertain about relative position - Diagonal blocks: Domains (confident within, uncertain between)
PAE is crucial for: - Identifying domain boundaries - Assessing multimer interface confidence - Understanding which parts of the structure are well-determined relative to each other
Practical Considerations
When to Use AlphaFold2
Best for: - Proteins with many homologs in databases - Monomers and stable complexes - When accuracy is paramount
Less suitable for: - Designed/synthetic proteins (few homologs) - Highly dynamic proteins - When speed is critical (use ESMFold instead)
Resource Requirements
| Resource | Minimum | Recommended |
|---|---|---|
| GPU RAM | 16 GB | 40+ GB (A100) |
| CPU RAM | 32 GB | 64 GB |
| Disk | 15 GB | 100+ GB (with databases) |
Tips for Better Predictions
- Check the AlphaFold Database first—your protein may already be predicted
- Use all 5 models for important predictions
- Examine pLDDT and PAE—don’t just look at the structure
- Consider multiple conformations for flexible proteins
- Validate experimentally for high-stakes applications
Hands-On Exercise
Part 1: Run ColabFold Prediction
Goal: Generate your first AlphaFold2 prediction.
We’ll predict the structure of Green Fluorescent Protein (GFP) using ColabFold.
1. Prepare your sequence:
Create a file called gfp.fasta:
>GFP
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
You can also download this file: 1GFL.fasta
2. Run prediction:
If using LocalColabFold on your HPC:
colabfold_batch gfp.fasta gfp_output/If using the ColabFold notebook: - Go to ColabFold - Paste your sequence - Run all cells
3. Expected output: - gfp_relaxed_rank_001_*.pdb - Best predicted structure - gfp_scores_rank_001_*.json - Confidence scores - gfp_coverage.png - MSA coverage - gfp_pae.png - PAE heatmap
Expected runtime: 5-15 minutes depending on your hardware.
Part 2: Analyze Your Prediction
1. Load the structure in PyMOL:
load gfp_output/gfp_relaxed_rank_001_alphafold2_ptm_model_1_seed_000.pdb, af2_gfp2. Color by confidence (pLDDT):
spectrum b, blue_white_red, minimum=50, maximum=1003. Questions to answer: - What is the overall pLDDT? (Check the JSON file or PyMOL’s B-factor range) - Which regions have high confidence? Low confidence? - Does GFP have any disordered regions?
4. Compare to experimental structure:
fetch 1GFL
align af2_gfp, 1GFL- What is the RMSD between prediction and experiment?
- Do the structures overlay well?
Part 3: Explore the PAE
1. Open gfp_pae.png
2. Interpret the plot: - Is there one block (single domain) or multiple blocks? - Are there any high-PAE (red) regions? - What does this tell you about the structure?
Part 4: Experiment with Parameters
Try running predictions with different settings:
Experiment 1: Fewer recycles
colabfold_batch --num-recycle 1 gfp.fasta gfp_1recycle/Compare to the default (3 recycles). Is there a difference in quality?
Experiment 2: Single sequence (no MSA)
colabfold_batch --msa-mode single_sequence gfp.fasta gfp_single_seq/This mimics what ESMFold does. How does quality compare?
Experiment 3: All 5 models
colabfold_batch --num-models 5 gfp.fasta gfp_all_models/Do the different models agree? What’s the variance in pLDDT?
Part 5: Multimer Prediction (Optional)
If you have time, try predicting a protein complex:
1. Create a multimer FASTA:
>homodimer
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK:MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
(Note: The : separates chains)
2. Run prediction:
colabfold_batch homodimer.fasta homodimer_output/3. Analyze: - What is the ipTM score? - Does the interface look reasonable? - Check PAE for interface confidence (off-diagonal blocks)
Questions to Consider
- How does prediction time scale with sequence length?
- Why might some regions have low pLDDT even in a well-folded protein?
- When would you trust a prediction enough to use it for experimental planning?
- What would you do if AlphaFold2 and ESMFold give different predictions?
Record Your Results
Fill in this table with your observations:
| Metric | Your GFP Prediction |
|---|---|
| Average pLDDT | |
| pTM score | |
| RMSD to 1GFL | |
| Prediction time | |
| MSA depth | |
| Regions with pLDDT < 70 |