12. ESM3 (Optional)

ESM3 (paper, code) is a frontier generative model for biology that jointly reasons across three fundamental biological properties of proteins: sequence, structure, and function. It represents a multimodal generative masked language model.

Note: This tool is marked as OPTIONAL. Install if you’re interested in protein generation and multimodal design beyond structure prediction.

Why Use ESM3?

  • Multimodal generation: Jointly reason about sequence, structure, and function
  • Protein generation: Create novel proteins with desired properties
  • Sequence completion: Fill in masked or missing regions
  • Embeddings: Extract rich protein representations (ESM C)

Related Tools: For structure prediction only, see ESMFold. For sequence design given structure, see LigandMPNN.

Resource Requirements

Resource Minimum Recommended Notes
GPU RAM 16 GB 24+ GB For esm3-small (1.4B params)
CPU RAM 16 GB 32 GB For preprocessing
Disk Space 10 GB 20 GB Model weights
Python 3.10+ 3.10 Required

Model sizes:

  • esm3-small-2024-08 (1.4B params): Runs locally
  • esm3-medium-2024-08 (7B params): Via Forge API
  • esm3-large-2024-03 (98B params): Via Forge API

Preparation

Mark as complete

Prerequisites:

  • Completed HPC Setup guide
  • Conda/Mamba installed
  • HuggingFace account (for model access)

Installation

Mark as complete

  1. Create a conda environment:
mamba create -n esm3 python=3.10
mamba activate esm3
  1. Install the ESM library:
pip install esm

HuggingFace Authentication

ESM3 weights are stored on HuggingFace Hub. You need to authenticate:

  1. Create a HuggingFace account at huggingface.co
  2. Generate an API token with “Read” permission at huggingface.co/settings/tokens
  3. Authenticate in Python:
from huggingface_hub import login
login()  # Follow prompts to enter your token

Or set environment variable:

export HF_TOKEN="your_token_here"

Testing the Installation

Mark as complete

Create a test script test_esm3.py:

import os
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig

# Authenticate
# Method 1: Environment variable (Recommended for HPC jobs)
if "HF_TOKEN" in os.environ:
    login(token=os.environ["HF_TOKEN"])
# Method 2: Interactive login (Run once on login node)
else:
    login()

# Load the model (downloads weights on first run)
model = ESM3.from_pretrained("esm3-small-2024-08").to("cuda")  # or "cpu"

# Generate a protein sequence completion
prompt = "MKTVRQ_______________QLAEELSVSRQVIVQDIAYLRSLG"
protein = ESMProtein(sequence=prompt)

# Generate sequence
protein = model.generate(
    protein,
    GenerationConfig(track="sequence", num_steps=8, temperature=0.7)
)

print("Generated sequence:")
print(protein.sequence)

# Generate structure
protein = model.generate(
    protein,
    GenerationConfig(track="structure", num_steps=8)
)

# Save structure
protein.to_pdb("./generated.pdb")
print("Structure saved to generated.pdb")

Run the test:

python test_esm3.py

Success indicators:

  • Model loads without errors
  • Sequence completion fills in the masked region
  • Structure is generated and saved as PDB

Expected runtime: 2-5 minutes (first run downloads ~3GB weights).

HPC Job Script

#!/bin/bash
#SBATCH --job-name=esm3
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out

module load cuda/12.1

# source ~/.bashrc
mamba activate esm3

# Set HuggingFace token
export HF_TOKEN="your_token_here"

python generate_protein.py

Usage Examples

Sequence generation (fill masked regions):

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig

model = ESM3.from_pretrained("esm3-small-2024-08").to("cuda")

# Use underscores for masked positions
protein = ESMProtein(sequence="MKTVRQ_______________QLAEELSVSRQVIVQDIAYLRSLG")

# Generate
protein = model.generate(
    protein,
    GenerationConfig(track="sequence", num_steps=8, temperature=0.7)
)
print(protein.sequence)

Structure prediction:

protein = ESMProtein(sequence="MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLG")

protein = model.generate(
    protein,
    GenerationConfig(track="structure", num_steps=8)
)

protein.to_pdb("predicted.pdb")

Using ESM C for embeddings only (faster, smaller):

from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein, LogitsConfig

protein = ESMProtein(sequence="MKTVRQERLK")
client = ESMC.from_pretrained("esmc_300m").to("cuda")

# Get embeddings
protein_tensor = client.encode(protein)
logits_output = client.logits(
    protein_tensor,
    LogitsConfig(sequence=True, return_embeddings=True)
)

print(f"Embedding shape: {logits_output.embeddings.shape}")

Available Models

Model Parameters Availability
esm3-small-2024-08 1.4B Local (free)
esmc_300m 300M Local (fast embeddings)
esmc_600m 600M Local
esm3-medium-2024-08 7B Forge API
esm3-large-2024-03 98B Forge API

Generation Tracks

ESM3 can generate different “tracks”:

Track Description
sequence Generate amino acid sequence
structure Generate 3D coordinates
function Generate functional annotations

Key Parameters

Parameter Description
track What to generate: sequence, structure, function
num_steps Number of generation steps (more = better quality)
temperature Sampling diversity (higher = more diverse)

Use Cases

  • Protein generation: Create novel proteins
  • Sequence completion: Fill in missing regions
  • Structure prediction: Generate 3D structures
  • Function prediction: Predict functional properties
  • Embeddings: Extract protein representations for ML

Troubleshooting

WarningCommon Issues

HuggingFace authentication errors:

  • Verify token has “Read” permission
  • Run login() in Python and follow prompts
  • Or set HF_TOKEN environment variable

Model download issues:

  • Check network connectivity

  • Weights are large (~3GB for small model)

  • Set HF_HOME to location with space:

    export HF_HOME=/scratch/$USER/huggingface

GPU memory issues:

  • Use CPU if GPU is insufficient: .to("cpu")
  • Reduce batch size if processing multiple proteins
  • esmc_300m is smaller and faster for embeddings

Slow generation:

  • GPU strongly recommended
  • Reduce num_steps for faster (lower quality) results
  • Use ESM C for embeddings (no structure generation)