12. ESM3 (Optional)

ESM3 (paper, code) is a frontier generative model for biology that jointly reasons across three fundamental biological properties of proteins: sequence, structure, and function. It represents a multimodal generative masked language model.

Note: This tool is marked as OPTIONAL. Install if you’re interested in protein generation and multimodal design beyond structure prediction.

Why Use ESM3?

Multimodal generation: Jointly reason about sequence, structure, and function
Protein generation: Create novel proteins with desired properties
Sequence completion: Fill in masked or missing regions
Embeddings: Extract rich protein representations (ESM C)

Related Tools: For structure prediction only, see ESMFold. For sequence design given structure, see LigandMPNN.

Resource Requirements

Resource	Minimum	Recommended	Notes
GPU RAM	16 GB	24+ GB	For esm3-small (1.4B params)
CPU RAM	16 GB	32 GB	For preprocessing
Disk Space	10 GB	20 GB	Model weights
Python	3.10+	3.10	Required

Model sizes:

esm3-small-2024-08 (1.4B params): Runs locally
esm3-medium-2024-08 (7B params): Via Forge API
esm3-large-2024-03 (98B params): Via Forge API

Preparation

Mark as complete

Prerequisites:

Completed HPC Setup guide
Conda/Mamba installed
HuggingFace account (for model access)

Installation

Mark as complete

Create a conda environment:

mamba create -n esm3 python=3.10
mamba activate esm3

Install the ESM library:

pip install esm

HuggingFace Authentication

ESM3 weights are stored on HuggingFace Hub. You need to authenticate:

Create a HuggingFace account at huggingface.co
Generate an API token with “Read” permission at huggingface.co/settings/tokens
Authenticate in Python:

from huggingface_hub import login
login()  # Follow prompts to enter your token

Or set environment variable:

export HF_TOKEN="your_token_here"

Testing the Installation

Mark as complete

Create a test script test_esm3.py:

import os
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig

# Authenticate
# Method 1: Environment variable (Recommended for HPC jobs)
if "HF_TOKEN" in os.environ:
    login(token=os.environ["HF_TOKEN"])
# Method 2: Interactive login (Run once on login node)
else:
    login()

# Load the model (downloads weights on first run)
model = ESM3.from_pretrained("esm3-small-2024-08").to("cuda")  # or "cpu"

# Generate a protein sequence completion
prompt = "MKTVRQ_______________QLAEELSVSRQVIVQDIAYLRSLG"
protein = ESMProtein(sequence=prompt)

# Generate sequence
protein = model.generate(
    protein,
    GenerationConfig(track="sequence", num_steps=8, temperature=0.7)
)

print("Generated sequence:")
print(protein.sequence)

# Generate structure
protein = model.generate(
    protein,
    GenerationConfig(track="structure", num_steps=8)
)

# Save structure
protein.to_pdb("./generated.pdb")
print("Structure saved to generated.pdb")

Run the test:

python test_esm3.py

Success indicators:

Model loads without errors
Sequence completion fills in the masked region
Structure is generated and saved as PDB

Expected runtime: 2-5 minutes (first run downloads ~3GB weights).

HPC Job Script

#!/bin/bash
#SBATCH --job-name=esm3
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out

module load cuda/12.1

# source ~/.bashrc
mamba activate esm3

# Set HuggingFace token
export HF_TOKEN="your_token_here"

python generate_protein.py

Usage Examples

Sequence generation (fill masked regions):

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig

model = ESM3.from_pretrained("esm3-small-2024-08").to("cuda")

# Use underscores for masked positions
protein = ESMProtein(sequence="MKTVRQ_______________QLAEELSVSRQVIVQDIAYLRSLG")

# Generate
protein = model.generate(
    protein,
    GenerationConfig(track="sequence", num_steps=8, temperature=0.7)
)
print(protein.sequence)

Structure prediction:

protein = ESMProtein(sequence="MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLG")

protein = model.generate(
    protein,
    GenerationConfig(track="structure", num_steps=8)
)

protein.to_pdb("predicted.pdb")

Using ESM C for embeddings only (faster, smaller):

from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein, LogitsConfig

protein = ESMProtein(sequence="MKTVRQERLK")
client = ESMC.from_pretrained("esmc_300m").to("cuda")

# Get embeddings
protein_tensor = client.encode(protein)
logits_output = client.logits(
    protein_tensor,
    LogitsConfig(sequence=True, return_embeddings=True)
)

print(f"Embedding shape: {logits_output.embeddings.shape}")

Available Models

Model	Parameters	Availability
`esm3-small-2024-08`	1.4B	Local (free)
`esmc_300m`	300M	Local (fast embeddings)
`esmc_600m`	600M	Local
`esm3-medium-2024-08`	7B	Forge API
`esm3-large-2024-03`	98B	Forge API

Generation Tracks

ESM3 can generate different “tracks”:

Track	Description
`sequence`	Generate amino acid sequence
`structure`	Generate 3D coordinates
`function`	Generate functional annotations

Key Parameters

Parameter	Description
`track`	What to generate: `sequence`, `structure`, `function`
`num_steps`	Number of generation steps (more = better quality)
`temperature`	Sampling diversity (higher = more diverse)

Use Cases

Protein generation: Create novel proteins
Sequence completion: Fill in missing regions
Structure prediction: Generate 3D structures
Function prediction: Predict functional properties
Embeddings: Extract protein representations for ML

Troubleshooting

Common Issues

HuggingFace authentication errors:

Verify token has “Read” permission
Run login() in Python and follow prompts
Or set HF_TOKEN environment variable

Model download issues:

Check network connectivity
Weights are large (~3GB for small model)

Set HF_HOME to location with space:

export HF_HOME=/scratch/$USER/huggingface

GPU memory issues:

Use CPU if GPU is insufficient: .to("cpu")
Reduce batch size if processing multiple proteins
esmc_300m is smaller and faster for embeddings

Slow generation:

GPU strongly recommended
Reduce num_steps for faster (lower quality) results
Use ESM C for embeddings (no structure generation)