12. ESM3 (Optional)
ESM3 (paper, code) is a frontier generative model for biology that jointly reasons across three fundamental biological properties of proteins: sequence, structure, and function. It represents a multimodal generative masked language model.
Note: This tool is marked as OPTIONAL. Install if you’re interested in protein generation and multimodal design beyond structure prediction.
Why Use ESM3?
- Multimodal generation: Jointly reason about sequence, structure, and function
- Protein generation: Create novel proteins with desired properties
- Sequence completion: Fill in masked or missing regions
- Embeddings: Extract rich protein representations (ESM C)
Related Tools: For structure prediction only, see ESMFold. For sequence design given structure, see LigandMPNN.
Resource Requirements
| Resource | Minimum | Recommended | Notes |
|---|---|---|---|
| GPU RAM | 16 GB | 24+ GB | For esm3-small (1.4B params) |
| CPU RAM | 16 GB | 32 GB | For preprocessing |
| Disk Space | 10 GB | 20 GB | Model weights |
| Python | 3.10+ | 3.10 | Required |
Model sizes:
esm3-small-2024-08(1.4B params): Runs locallyesm3-medium-2024-08(7B params): Via Forge APIesm3-large-2024-03(98B params): Via Forge API
Preparation
Mark as complete
Prerequisites:
- Completed HPC Setup guide
- Conda/Mamba installed
- HuggingFace account (for model access)
Installation
Mark as complete
- Create a conda environment:
mamba create -n esm3 python=3.10
mamba activate esm3- Install the ESM library:
pip install esmHuggingFace Authentication
ESM3 weights are stored on HuggingFace Hub. You need to authenticate:
- Create a HuggingFace account at huggingface.co
- Generate an API token with “Read” permission at huggingface.co/settings/tokens
- Authenticate in Python:
from huggingface_hub import login
login() # Follow prompts to enter your tokenOr set environment variable:
export HF_TOKEN="your_token_here"Testing the Installation
Mark as complete
Create a test script test_esm3.py:
import os
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
# Authenticate
# Method 1: Environment variable (Recommended for HPC jobs)
if "HF_TOKEN" in os.environ:
login(token=os.environ["HF_TOKEN"])
# Method 2: Interactive login (Run once on login node)
else:
login()
# Load the model (downloads weights on first run)
model = ESM3.from_pretrained("esm3-small-2024-08").to("cuda") # or "cpu"
# Generate a protein sequence completion
prompt = "MKTVRQ_______________QLAEELSVSRQVIVQDIAYLRSLG"
protein = ESMProtein(sequence=prompt)
# Generate sequence
protein = model.generate(
protein,
GenerationConfig(track="sequence", num_steps=8, temperature=0.7)
)
print("Generated sequence:")
print(protein.sequence)
# Generate structure
protein = model.generate(
protein,
GenerationConfig(track="structure", num_steps=8)
)
# Save structure
protein.to_pdb("./generated.pdb")
print("Structure saved to generated.pdb")Run the test:
python test_esm3.pySuccess indicators:
- Model loads without errors
- Sequence completion fills in the masked region
- Structure is generated and saved as PDB
Expected runtime: 2-5 minutes (first run downloads ~3GB weights).
HPC Job Script
#!/bin/bash
#SBATCH --job-name=esm3
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
module load cuda/12.1
# source ~/.bashrc
mamba activate esm3
# Set HuggingFace token
export HF_TOKEN="your_token_here"
python generate_protein.pyUsage Examples
Sequence generation (fill masked regions):
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
model = ESM3.from_pretrained("esm3-small-2024-08").to("cuda")
# Use underscores for masked positions
protein = ESMProtein(sequence="MKTVRQ_______________QLAEELSVSRQVIVQDIAYLRSLG")
# Generate
protein = model.generate(
protein,
GenerationConfig(track="sequence", num_steps=8, temperature=0.7)
)
print(protein.sequence)Structure prediction:
protein = ESMProtein(sequence="MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLG")
protein = model.generate(
protein,
GenerationConfig(track="structure", num_steps=8)
)
protein.to_pdb("predicted.pdb")Using ESM C for embeddings only (faster, smaller):
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein, LogitsConfig
protein = ESMProtein(sequence="MKTVRQERLK")
client = ESMC.from_pretrained("esmc_300m").to("cuda")
# Get embeddings
protein_tensor = client.encode(protein)
logits_output = client.logits(
protein_tensor,
LogitsConfig(sequence=True, return_embeddings=True)
)
print(f"Embedding shape: {logits_output.embeddings.shape}")Available Models
| Model | Parameters | Availability |
|---|---|---|
esm3-small-2024-08 |
1.4B | Local (free) |
esmc_300m |
300M | Local (fast embeddings) |
esmc_600m |
600M | Local |
esm3-medium-2024-08 |
7B | Forge API |
esm3-large-2024-03 |
98B | Forge API |
Generation Tracks
ESM3 can generate different “tracks”:
| Track | Description |
|---|---|
sequence |
Generate amino acid sequence |
structure |
Generate 3D coordinates |
function |
Generate functional annotations |
Key Parameters
| Parameter | Description |
|---|---|
track |
What to generate: sequence, structure, function |
num_steps |
Number of generation steps (more = better quality) |
temperature |
Sampling diversity (higher = more diverse) |
Use Cases
- Protein generation: Create novel proteins
- Sequence completion: Fill in missing regions
- Structure prediction: Generate 3D structures
- Function prediction: Predict functional properties
- Embeddings: Extract protein representations for ML
Troubleshooting
HuggingFace authentication errors:
- Verify token has “Read” permission
- Run
login()in Python and follow prompts - Or set
HF_TOKENenvironment variable
Model download issues:
Check network connectivity
Weights are large (~3GB for small model)
Set
HF_HOMEto location with space:export HF_HOME=/scratch/$USER/huggingface
GPU memory issues:
- Use CPU if GPU is insufficient:
.to("cpu") - Reduce batch size if processing multiple proteins
esmc_300mis smaller and faster for embeddings
Slow generation:
- GPU strongly recommended
- Reduce
num_stepsfor faster (lower quality) results - Use ESM C for embeddings (no structure generation)