3. AlphaFold2 and OpenFold

This module dives deep into AlphaFold2—the breakthrough model that essentially “solved” the protein structure prediction problem—and OpenFold, its open-source, trainable implementation.

Live Workshop Session

🎥 Live workshop recording — AlphaFold2 Structure Prediction

📊 View slide deck

The AlphaFold2 Breakthrough

CASP14: A Watershed Moment

At the 14th Critical Assessment of protein Structure Prediction (CASP14) in 2020, AlphaFold2 achieved what many thought was impossible:

GDT-TS scores of ~90 on targets where the previous state-of-the-art was ~60
Near-experimental accuracy for many proteins
Consistent performance across diverse protein families

This wasn’t incremental improvement—it was a paradigm shift.

The Scale of the Achievement

To put this in perspective: before AlphaFold2, structure prediction was considered one of biology’s grand challenges. Some estimated it would take decades more to solve. AlphaFold2 essentially closed this chapter.

AlphaFold2 vs OpenFold

AlphaFold2 (DeepMind): - Original implementation in JAX - Released weights and inference code - Not easily trainable by the community

OpenFold (Columbia/Harvard): - Faithful PyTorch reproduction - Fully trainable on new data - Community-friendly and extensible - 3-5x faster for most proteins - Lower memory usage: Can predict longer proteins on single GPUs

For this bootcamp, we’ll use ColabFold, which combines AlphaFold2’s models with fast MSA generation from MMseqs2.

Reference: Ahdritz et al. (2024) - OpenFold paper

How AlphaFold2 Works

High-Level Architecture

AlphaFold2’s architecture flows from the sequence to the final 3D structure through several specialized modules.

flowchart LR
    Seq[Input Sequence] --> MSA[MSA Generation]
    Seq --> Templ[Template Search]
    
    MSA --> Evo[Evoformer]
    Templ --> Evo
    
    Evo --> Struct[Structure Module]
    Struct --> Coord[3D Coordinates]
    
    subgraph "Iterative Refinement (Recycling)"
    Evo
    Struct
    end
    
    style Evo fill:#f9f,stroke:#333,stroke-width:2px
    style Struct fill:#bbf,stroke:#333,stroke-width:2px

Inputs and Outputs

Inputs: 1. Query sequence: The protein you want to predict 2. Multiple Sequence Alignment (MSA): Related sequences found by database search 3. Templates (optional): Known structures of homologous proteins

Outputs: 1. 3D coordinates: Atomic positions for all residues 2. pLDDT scores: Per-residue confidence (0-100) 3. pTM score: Overall structure confidence 4. PAE matrix: Predicted Aligned Error between residue pairs

The MSA: Why Evolutionary Information Matters

The Multiple Sequence Alignment is arguably the most important input to AlphaFold2. It aligns your query sequence with evolutionarily related sequences to find co-evolution patterns.

Query:     MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELAL
Homolog1:  MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELAL
Homolog2:  MKVLWGALLVTFLAGCQAKIEQAVETEPEPELRQQTEWQSGQRWDLAL
Homolog3:  MKILWAALLVSFLAGCQAKVEQAVEAEPEPELRQQTEWQSGQRWELAL
           ** **.****:******* :*****.**************.*:***

The MSA is Critical

AlphaFold2’s accuracy depends heavily on MSA quality. Proteins with few homologs (orphan proteins, designed proteins) are harder to predict because there’s less evolutionary information to leverage.

How MSAs are Generated

AlphaFold2 uses multiple database search tools to build comprehensive MSAs:

JackHMMER (iterative profile HMM search):

Searches UniRef90, MGnify, and other databases
Iteratively builds a profile from hits and re-searches
Highly sensitive but computationally expensive
Used for the “genetic” MSA in AlphaFold2

HHBlits (HMM-HMM search):

Searches clustered databases like BFD (Big Fantastic Database)
Faster than JackHMMER with comparable sensitivity
Used for additional MSA depth

MMseqs2 (ColabFold’s approach):

100-1000x faster than JackHMMER
Searches pre-computed ColabFold databases
Slight accuracy trade-off for massive speed gains
Makes AlphaFold2 practical for large-scale predictions

MSA Depth Matters

The number of effective sequences (Neff) in an MSA correlates strongly with prediction accuracy:

Neff	Expected Accuracy
>1000	High confidence predictions likely
100-1000	Good predictions for most proteins
30-100	Predictions may be unreliable in some regions
<30	Significant uncertainty; consider single-sequence methods

Neff accounts for sequence redundancy—100 nearly identical sequences contribute less information than 100 diverse sequences.

Co-evolution: The Key to Structure Prediction

The fundamental insight behind using MSAs for structure prediction is co-evolution: residues that are in physical contact in the 3D structure tend to mutate together during evolution to maintain their interaction.

flowchart LR
    subgraph "In 3D Structure"
        A[Residue i<br/>Asp⁻] <--> B[Residue j<br/>Lys⁺]
    end

    subgraph "During Evolution"
        C[Asp⁻ → Lys⁺] --> D[Lys⁺ → Asp⁻]
    end

    A --> C
    B --> D

    style A fill:#ffcccc
    style B fill:#ccccff

Why does this happen?

Consider two residues forming a salt bridge (Asp⁻ interacting with Lys⁺):

If position i mutates from Asp to Lys (negative to positive charge)
The interaction is disrupted—the protein may misfold
Unless position j compensates by mutating from Lys to Asp
The complementary mutation restores the interaction

This creates correlated mutations that we can detect statistically across thousands of homologous sequences.

From Co-evolution to Contacts: Direct Coupling Analysis

Before deep learning, methods like Direct Coupling Analysis (DCA) extracted contact predictions from MSAs:

flowchart TD
    MSA[Multiple Sequence Alignment] --> MI[Mutual Information<br/>Raw correlations]
    MI --> Problem[Problem: Transitive correlations<br/>A↔B and B↔C implies spurious A↔C]
    Problem --> DCA[Direct Coupling Analysis<br/>Separates direct from indirect]
    DCA --> Contacts[Contact Predictions]

    style Problem fill:#ffcccc
    style DCA fill:#ccffcc

The challenge: If residue A co-evolves with B, and B co-evolves with C, simple correlation analysis will show spurious A↔︎C co-evolution even if they never contact.

DCA’s solution: Use inverse covariance (precision) matrices or pseudolikelihood methods to separate direct couplings from indirect (transitive) correlations.

AlphaFold2 doesn’t explicitly compute DCA—instead, its attention mechanisms learn to extract these signals directly from the MSA representation.

How AlphaFold2 Processes the MSA

The MSA enters AlphaFold2 as a tensor with dimensions:

Sequences: Number of aligned sequences (typically ~500-5000, subsampled from full MSA)
Residues: Length of the query sequence
Features: One-hot encoding + positional features (22 dimensions per residue)

The model processes this through two parallel representations:

flowchart TB
    MSA[MSA Input<br/>N_seq × L × 22] --> MSArep[MSA Representation<br/>N_seq × L × 256]
    MSA --> Pair[Pair Representation<br/>L × L × 128]

    MSArep <--> Evo[Evoformer<br/>48 blocks]
    Pair <--> Evo

    Evo --> Final[To Structure Module]

    style Evo fill:#f9f,stroke:#333,stroke-width:2px

MSA Representation: Tracks information about each residue in each sequence
Pair Representation: Tracks relationships between all pairs of residues in the query

Check Your Understanding

Why is the MSA so critical for AlphaFold2's accuracy?

It provides template structures for the model to copy.

It reveals co-evolutionary patterns that imply 3D contacts.

It just increases the size of the training data.

The Evoformer: Learning from Evolution

The Evoformer is AlphaFold2’s core innovation—a specialized neural network architecture that processes evolutionary information to understand protein structure. It consists of 48 blocks, each updating two interconnected representations.

The Two Representations

The Evoformer maintains and iteratively refines two tensors:

1. MSA Representation (N_seq × L × 256):

Each entry represents one residue in one sequence
Captures per-position information across evolutionary history
Updated by row and column attention

2. Pair Representation (L × L × 128):

Each entry represents the relationship between two residues
Encodes distance, orientation, and contact information
Updated by triangle attention and outer product mean

flowchart TB
    subgraph "Evoformer Block (×48)"
        subgraph "MSA Stack"
            Row[Row-wise Attention<br/>with pair bias]
            Col[Column-wise Attention]
        end

        subgraph "Communication"
            OPM[Outer Product Mean<br/>MSA → Pair]
        end

        subgraph "Pair Stack"
            TriStart[Triangle Attention<br/>Starting Node]
            TriEnd[Triangle Attention<br/>Ending Node]
            TriOut[Triangle Multiplicative<br/>Outgoing]
            TriIn[Triangle Multiplicative<br/>Incoming]
            Trans[Transition Layer]
        end
    end

    Row --> Col
    Col --> OPM
    OPM --> TriStart
    TriStart --> TriEnd
    TriEnd --> TriOut
    TriOut --> TriIn
    TriIn --> Trans

    style OPM fill:#ffd700

Row-wise Attention: Within-Sequence Reasoning

Row attention operates along the residue dimension for each sequence in the MSA:

For each sequence s in the MSA:
    Query, Key, Value = Linear projections of MSA representation
    Attention weights = softmax(Q·K^T / √d + pair_bias)
    Output = Attention weights · V

Key insight: The attention weights are biased by the pair representation. This means the model’s understanding of residue relationships directly influences how it processes each sequence.

What this achieves:

Residues that are structurally related attend to each other
Long-range dependencies are captured regardless of sequence distance
The pair bias acts like a “structural prior” during MSA processing

Column-wise Attention: Cross-Sequence Reasoning

Column attention operates across sequences at each position:

For each position i in the sequence:
    Compare how position i appears across all sequences
    Learn patterns of conservation and variation

What this achieves:

Identifies conserved residues (important for function/structure)
Detects co-varying positions (co-evolution signal)
Aggregates information from thousands of evolutionary samples

Row vs Column Attention Intuition

Row attention: “What other positions in this sequence are relevant to position i?”
Column attention: “What can I learn about position i by looking at how it varies across evolution?”

The Outer Product Mean: Bridging MSA and Pairs

The Outer Product Mean is the critical operation that transfers co-evolution information from the MSA representation into the pair representation:

For positions i and j:
    pair_update[i,j] = Mean over sequences s of (MSA[s,i] ⊗ MSA[s,j])

Where ⊗ denotes the outer product.

Intuition: If positions i and j consistently appear together in certain amino acid combinations across many sequences, this outer product will capture that correlation and inject it into the pair representation.

This is how AlphaFold2 “learns” co-evolution: The outer product implicitly computes correlation statistics between positions, similar to what DCA does explicitly—but in a learnable, differentiable way.

Triangle Updates: Enforcing Geometric Consistency

The triangle updates are what make the pair representation geometrically consistent. They’re based on a simple principle: distances must satisfy the triangle inequality.

flowchart LR
    subgraph "Triangle Inequality"
        A((i)) --- B((j))
        B --- C((k))
        A --- C
    end

    subgraph "Constraint"
        D[d_ij + d_jk ≥ d_ik]
    end

If we know the relationships i↔︎k and k↔︎j, we can infer something about i↔︎j.

Triangle Attention (Starting and Ending Node):

“Starting node”: For edge i→j, attend over all edges i→k that share the starting node
“Ending node”: For edge i→j, attend over all edges k→j that share the ending node

Triangle attention (starting):
    For edge (i,j): aggregate information from all (i,k) edges

Triangle attention (ending):
    For edge (i,j): aggregate information from all (k,j) edges

Triangle Multiplicative Updates (Outgoing and Incoming):

These use multiplicative gating to combine information:

Triangle multiplicative (outgoing):
    pair[i,j] += Σ_k gate(pair[i,k]) × pair[k,j]

Triangle multiplicative (incoming):
    pair[i,j] += Σ_k pair[i,k] × gate(pair[k,j])

Why both attention AND multiplicative? They capture different types of relationships:

Attention: Soft selection of which intermediate nodes k matter
Multiplicative: Direct combination of path information

Triangle Updates Are Like Message Passing on a Graph

Think of the pair representation as a fully-connected graph where each edge (i,j) stores information. Triangle updates pass messages along triangles, enforcing that if A is close to B and B is close to C, this constrains the A-C relationship.

Evoformer Summary

After 48 blocks, the Evoformer has:

Extracted co-evolutionary signals from the MSA via column attention and outer product mean
Propagated structural information through row attention with pair bias
Enforced geometric consistency through triangle updates
Built a rich pair representation encoding likely 3D relationships

Check Your Understanding

Which Evoformer operation transfers co-evolution signals from the MSA to the pair representation?

Row-wise attention

Triangle multiplicative update

Outer product mean

Column-wise attention

The Structure Module: From Representations to 3D Coordinates

The Structure Module takes the refined representations from the Evoformer and converts them into actual 3D atomic coordinates. This is where AlphaFold2 transitions from “reasoning about structure” to “predicting structure.”

The Key Challenge: SE(3) Equivariance

Protein structure prediction has a fundamental requirement: the prediction should be independent of how we orient the input.

If we rotate the coordinate frame, the predicted structure should rotate identically
If we translate the origin, distances should be preserved
The prediction should depend only on intrinsic properties, not arbitrary reference frames

This is called SE(3) equivariance (Special Euclidean group in 3D = rotations + translations).

The problem: Standard neural networks don’t naturally respect this symmetry. A naive network would give different predictions for the same protein oriented differently.

Frames: The Backbone Representation

AlphaFold2 represents each residue using a local coordinate frame (also called a “rigid body” or “frame”):

flowchart LR
    subgraph "Residue Frame"
        O[Origin at Cα] --> X[x-axis: N→Cα direction]
        O --> Y[y-axis: perpendicular in plane]
        O --> Z[z-axis: completes right-hand system]
    end

Each frame consists of:

Translation (t): The 3D position of the Cα atom
Rotation (R): A 3×3 rotation matrix defining the local orientation

The backbone is thus represented as a sequence of frames: T₁, T₂, …, T_L

Why frames? They naturally encode:

Position (where is this residue?)
Orientation (how is the peptide bond oriented?)
Relative geometry (how do two residues relate in space?)

Invariant Point Attention (IPA): The Core Innovation

Invariant Point Attention is the structure module’s key mechanism. It’s a modified attention operation that:

Is invariant to global rotations and translations
Can reason about 3D geometric relationships
Uses both the pair representation and current 3D coordinates

flowchart TB
    subgraph "IPA Inputs"
        Single[Single Representation<br/>per-residue features]
        Pair[Pair Representation<br/>from Evoformer]
        Frames[Current Frames<br/>T_i for each residue]
    end

    subgraph "IPA Computation"
        QKV[Generate Q, K, V<br/>in local frames]
        Points[Generate query/key points<br/>in local coordinates]
        Transform[Transform points to<br/>global frame]
        Dist[Compute point-to-point<br/>distances]
        Attn[Attention with<br/>pair bias + distance bias]
    end

    Single --> QKV
    Frames --> Points
    Points --> Transform
    Transform --> Dist
    Pair --> Attn
    Dist --> Attn
    QKV --> Attn

    Attn --> Output[Updated Single<br/>Representation]

The “invariant points” trick:

Each residue generates “query points” and “key points” in its local coordinate frame
These points are transformed to the global frame using the current frame estimate
Distances between points are computed (distances are invariant to rotation/translation!)
These distances bias the attention weights

Why this works: By working with distances between transformed points, IPA can reason about 3D geometry while remaining invariant to the choice of global reference frame.

The Frame Update: Predicting Backbone Geometry

After each IPA layer, the structure module updates the residue frames:

For each residue i:
    quaternion, translation = MLP(single_representation[i])
    T_i = T_i ∘ (quaternion_to_rotation(quaternion), translation)

Key insight: Updates are applied as compositions with the current frame, not absolute predictions. This makes learning easier—the network only needs to predict small refinements.

The frames start from an initial guess (often just the identity—all residues at the origin with no rotation) and are iteratively refined.

Side Chain Prediction: Torsion Angles

The structure module also predicts side chain conformations using torsion (dihedral) angles:

backbone torsion angles: φ, ψ, ω
side chain torsion angles: χ₁, χ₂, χ₃, χ₄ (depending on amino acid)

Each angle is predicted as:

(sin(angle), cos(angle)) = MLP(single_representation)
angle = atan2(sin, cos)

Why sin/cos instead of the angle directly? Angles wrap around (0° = 360°), which creates discontinuities. Predicting (sin, cos) avoids this problem.

From Frames to Atoms

Once we have frames and torsion angles, computing atom positions is deterministic:

For each residue:
    1. Place backbone atoms (N, Cα, C, O) using the frame
    2. Place Cβ using ideal bond geometry
    3. Rotate each side chain bond by predicted χ angles
    4. All atom positions fall out of the geometry

This uses idealized bond lengths and angles from chemistry—the network only predicts the “free” degrees of freedom (frame rotations/translations and torsion angles).

Structure Module Architecture Summary

flowchart TB
    EvoOut[Evoformer Output<br/>Single + Pair repr.] --> Init[Initialize Frames<br/>Identity transforms]

    subgraph "Structure Module (8 layers)"
        Init --> IPA1[IPA Layer 1]
        IPA1 --> Update1[Frame Update]
        Update1 --> IPA2[IPA Layer 2]
        IPA2 --> Update2[Frame Update]
        Update2 --> Dots[...]
        Dots --> IPA8[IPA Layer 8]
        IPA8 --> Update8[Frame Update]
    end

    Update8 --> Torsion[Predict Torsion Angles]
    Update8 --> Final[Final Frames]

    Final --> Atoms[Compute Atom<br/>Positions]
    Torsion --> Atoms

    style IPA1 fill:#bbf
    style IPA8 fill:#bbf

Key Parameters and Settings

Recycling Parameters

Parameter	What it does	Typical values
`num_recycles`	How many refinement passes	3-20 (more = better but slower)
`max_recycles`	Hard cap for recycling	20
`recycle_early_stop_tolerance`	Stop if structure converges	0.5 Å

When to increase recycles: - Large proteins (>500 residues) - Multi-domain proteins - When initial predictions look uncertain

MSA Settings

Parameter	What it does	Options
`msa_mode`	How to generate MSA	`mmseqs2` (fast), `jackhmmer` (thorough)
`pair_mode`	For multimers: how to pair sequences	`paired`, `unpaired`, `paired+unpaired`
`use_msa`	Whether to use MSA at all	true/false
`use_templates`	Whether to search for structural templates	true/false

Single-Sequence Mode

You can run AlphaFold2 without an MSA (use_msa=false), but accuracy drops significantly. Use this only for: - Designed proteins with no natural homologs - Quick preliminary scans - When comparing to ESMFold

Model Selection

Parameter	What it does	Options
`model_type`	Monomer vs multimer	`monomer`, `monomer_ptm`, `multimer`
`num_models`	Number of model versions to run	1-5
`rank_by`	How to select best prediction	`plddt`, `ptm`, `iptm+ptm`

The 5 models: AlphaFold2 was trained with 5 different random initializations. Running all 5 provides: - Ensemble diversity (different predictions) - Uncertainty estimation (do they agree?) - Better chance of finding the best structure

Relaxation

Parameter	What it does
`use_amber`	Run Amber energy minimization
`use_gpu_relax`	Use GPU for relaxation (faster)

Why relax? - Fixes minor clashes and bad geometry - Makes structures more physically realistic - Important for downstream applications (docking, MD)

AlphaFold2 Extensions

AlphaFold-Multimer

Predicting protein complexes (multiple chains interacting):

Key modifications: - Cross-chain MSA pairing for evolutionary signal - Losses account for chain permutation symmetry - Interface-aware training

# In your FASTA file, separate chains with ":"
>complex
SEQUENCEOFCHAINA:SEQUENCEOFCHAINB

AlphaFold Database

DeepMind released predictions for 200+ million proteins—essentially all of UniProt.

Access: alphafold.ebi.ac.uk
Coverage: Human proteome + 47 other key organisms
Use case: Check if your protein has already been predicted!

AF-Cluster

Idea: Proteins can have multiple conformations. Can we bias AlphaFold2 toward different states?

Approach: - Cluster the MSA into subgroups - Run predictions with different MSA subsets - Different clusters may yield different conformational states

Reference: Wayment-Steele et al. (2022)

AFsample

Idea: Generate diverse predictions through sampling.

Approach: - Enable dropout during inference - Increase recycling - Generate multiple diverse predictions

This is useful for understanding conformational flexibility and uncertainty.

Automated Workflows

Tools like EvoPro and BindCraft use AlphaFold2 as part of larger design pipelines:

Generate candidate designs
Predict structures with AF2
Score and filter based on confidence
Iterate to improve designs

Understanding the Output

pLDDT: Per-Residue Confidence

pLDDT (predicted Local Distance Difference Test) ranges from 0-100:

pLDDT	Interpretation	What to do
>90	Very high confidence	Trust the local structure
70-90	Confident	Generally reliable
50-70	Low confidence	Treat with caution
<50	Very low confidence	Likely disordered or wrong

Low pLDDT ≠ Wrong

Low pLDDT regions often correspond to: - Intrinsically disordered regions (genuinely unstructured) - Flexible loops (multiple conformations possible) - Crystal contacts (structure depends on environment)

Low confidence is information, not failure!

pTM and ipTM: Global Confidence

pTM (predicted TM-score): Overall structure confidence (0-1)
ipTM (interface pTM): Confidence in interface prediction (for multimers)

Score	Interpretation
>0.8	High confidence
0.5-0.8	Moderate confidence
<0.5	Low confidence

PAE: Predicted Aligned Error

The PAE matrix shows confidence in relative positions between residue pairs.

Reading PAE plots: - Blue/low values: Confident in relative position - Red/high values: Uncertain about relative position - Diagonal blocks: Domains (confident within, uncertain between)

PAE is crucial for: - Identifying domain boundaries - Assessing multimer interface confidence - Understanding which parts of the structure are well-determined relative to each other

Practical Considerations

When to Use AlphaFold2

Best for: - Proteins with many homologs in databases - Monomers and stable complexes - When accuracy is paramount

Less suitable for: - Designed/synthetic proteins (few homologs) - Highly dynamic proteins - When speed is critical (use ESMFold instead)

Resource Requirements

Resource	Minimum	Recommended
GPU RAM	16 GB	40+ GB (A100)
CPU RAM	32 GB	64 GB
Disk	15 GB	100+ GB (with databases)

Tips for Better Predictions

Check the AlphaFold Database first—your protein may already be predicted
Use all 5 models for important predictions
Examine pLDDT and PAE—don’t just look at the structure
Consider multiple conformations for flexible proteins
Validate experimentally for high-stakes applications

Hands-On Exercise

Part 1: Run ColabFold Prediction

Goal: Generate your first AlphaFold2 prediction.

We’ll predict the structure of Green Fluorescent Protein (GFP) using ColabFold.

1. Prepare your sequence:

Create a file called gfp.fasta:

>GFP
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK

You can also download this file: 1GFL.fasta

2. Run prediction:

If using LocalColabFold on your HPC:

colabfold_batch gfp.fasta gfp_output/

If using the ColabFold notebook: - Go to ColabFold - Paste your sequence - Run all cells

3. Expected output: - gfp_relaxed_rank_001_*.pdb - Best predicted structure - gfp_scores_rank_001_*.json - Confidence scores - gfp_coverage.png - MSA coverage - gfp_pae.png - PAE heatmap

Expected runtime: 5-15 minutes depending on your hardware.

Part 2: Analyze Your Prediction

1. Load the structure in PyMOL:

load gfp_output/gfp_relaxed_rank_001_alphafold2_ptm_model_1_seed_000.pdb, af2_gfp

2. Color by confidence (pLDDT):

spectrum b, blue_white_red, minimum=50, maximum=100

3. Questions to answer: - What is the overall pLDDT? (Check the JSON file or PyMOL’s B-factor range) - Which regions have high confidence? Low confidence? - Does GFP have any disordered regions?

4. Compare to experimental structure:

fetch 1GFL
align af2_gfp, 1GFL

What is the RMSD between prediction and experiment?
Do the structures overlay well?

Part 3: Explore the PAE

1. Open gfp_pae.png

2. Interpret the plot: - Is there one block (single domain) or multiple blocks? - Are there any high-PAE (red) regions? - What does this tell you about the structure?

Part 4: Experiment with Parameters

Try running predictions with different settings:

Experiment 1: Fewer recycles

colabfold_batch --num-recycle 1 gfp.fasta gfp_1recycle/

Compare to the default (3 recycles). Is there a difference in quality?

Experiment 2: Single sequence (no MSA)

colabfold_batch --msa-mode single_sequence gfp.fasta gfp_single_seq/

This mimics what ESMFold does. How does quality compare?

Experiment 3: All 5 models

colabfold_batch --num-models 5 gfp.fasta gfp_all_models/

Do the different models agree? What’s the variance in pLDDT?

Part 5: Multimer Prediction (Optional)

If you have time, try predicting a protein complex:

1. Create a multimer FASTA:

>homodimer
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK:MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK

(Note: The : separates chains)

2. Run prediction:

colabfold_batch homodimer.fasta homodimer_output/

3. Analyze: - What is the ipTM score? - Does the interface look reasonable? - Check PAE for interface confidence (off-diagonal blocks)

Questions to Consider

How does prediction time scale with sequence length?
Why might some regions have low pLDDT even in a well-folded protein?
When would you trust a prediction enough to use it for experimental planning?
What would you do if AlphaFold2 and ESMFold give different predictions?

Record Your Results

Fill in this table with your observations:

Metric	Your GFP Prediction
Average pLDDT
pTM score
RMSD to 1GFL
Prediction time
MSA depth
Regions with pLDDT < 70

Live Workshop Session

The AlphaFold2 Breakthrough

CASP14: A Watershed Moment

AlphaFold2 vs OpenFold

How AlphaFold2 Works

High-Level Architecture

Inputs and Outputs

The MSA: Why Evolutionary Information Matters

How MSAs are Generated

Co-evolution: The Key to Structure Prediction

From Co-evolution to Contacts: Direct Coupling Analysis

How AlphaFold2 Processes the MSA

The Evoformer: Learning from Evolution

The Two Representations

Row-wise Attention: Within-Sequence Reasoning

Column-wise Attention: Cross-Sequence Reasoning

The Outer Product Mean: Bridging MSA and Pairs

Triangle Updates: Enforcing Geometric Consistency

Evoformer Summary

The Structure Module: From Representations to 3D Coordinates

The Key Challenge: SE(3) Equivariance

Frames: The Backbone Representation

Invariant Point Attention (IPA): The Core Innovation

The Frame Update: Predicting Backbone Geometry

Side Chain Prediction: Torsion Angles

From Frames to Atoms

Structure Module Architecture Summary

Recycling: Iterative Refinement

Key Parameters and Settings

Recycling Parameters

MSA Settings

Model Selection

Relaxation

AlphaFold2 Extensions

AlphaFold-Multimer

AlphaFold Database

AF-Cluster

AFsample

Automated Workflows

Understanding the Output

pLDDT: Per-Residue Confidence

pTM and ipTM: Global Confidence

PAE: Predicted Aligned Error

Practical Considerations

When to Use AlphaFold2

Resource Requirements

Tips for Better Predictions

Hands-On Exercise

Part 1: Run ColabFold Prediction

Part 2: Analyze Your Prediction

Part 3: Explore the PAE

Part 4: Experiment with Parameters

Part 5: Multimer Prediction (Optional)

Questions to Consider

Record Your Results