3. LigandMPNN

LigandMPNN (paper, code) is a deep learning model for context-aware protein sequence design. It extends ProteinMPNN to handle small molecules, metal ions, and other non-protein components in protein design tasks.

Live Workshop Session

🎥 Live workshop recording — Inverse folding using ProteinMPNN

Why Use LigandMPNN?

Ligand-aware design: Design sequences that account for bound cofactors, substrates, or drug molecules
Context preservation: Maintain interactions with metals, DNA, RNA, or other molecules
Side chain packing: Evaluate and optimize side chain conformations
Flexible residue control: Fix, bias, or vary specific positions

Related Tools: Use with RFdiffusion2 for backbone design, or BindCraft for complete binder design pipelines.

Resource Requirements

Resource	Minimum	Recommended	Notes
GPU RAM	4 GB	16 GB	Scales with protein size
CPU RAM	8 GB	16 GB	CPU-only is viable but slower
Disk Space	2 GB	5 GB	Model weights
Python	3.9+	3.11	Required

Preparation

Mark as complete

Prerequisites:

Completed HPC Setup guide
Conda/Mamba installed
Git installed

Verify your environment:

python --version    # Should be 3.9+
nvcc --version      # For GPU support (optional)

Installation

Mark as complete

Clone the LigandMPNN repository:

git clone https://github.com/dauparas/LigandMPNN.git
cd LigandMPNN

Download the model parameters: Note: This step requires internet access. If your compute node doesn’t have internet, run this on a login node.

bash get_model_params.sh "./model_params"

Expected download: ~500 MB of model weights.

Create a new conda environment:

mamba create -n ligandmpnn_env python=3.11
mamba activate ligandmpnn_env

Install dependencies:

pip install -r requirements.txt

This installs PyTorch, NumPy, and ProDy for PDB file handling.

Testing the Installation

Mark as complete

Run a test design on the provided example structure:

python run.py \
    --seed 111 \
    --pdb_path "./inputs/1BC8.pdb" \
    --out_folder "./outputs/test_output"

Success indicators:

Command completes without errors
Output folder contains:
- seqs/1BC8.fa - Designed sequences in FASTA format
- backbones/1BC8.pdb - Input backbone (for reference)
- packed/1BC8_1.pdb - Structure with designed side chains

Expected runtime: <1 minute on GPU, ~5 minutes on CPU.

HPC Job Script

#!/bin/bash
#SBATCH --job-name=ligandmpnn
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out

module load cuda/12.1

# source ~/.bashrc  # Optional: Source shell profile if needed
mamba activate ligandmpnn_env

cd /path/to/LigandMPNN

python run.py \
    --model_type "ligand_mpnn" \
    --seed 111 \
    --pdb_path "./inputs/my_protein.pdb" \
    --out_folder "./outputs/my_design" \
    --number_of_batches 10

Usage Examples

Basic protein design (no ligand):

python run.py \
    --pdb_path "protein.pdb" \
    --out_folder "output/"

Design with ligand context:

python run.py \
    --model_type "ligand_mpnn" \
    --pdb_path "protein_ligand.pdb" \
    --out_folder "output/"

Fix specific residues (keep them unchanged):

python run.py \
    --pdb_path "protein.pdb" \
    --fixed_residues "A10 A20 A30" \
    --out_folder "output/"

Design only specific positions:

python run.py \
    --pdb_path "protein.pdb" \
    --redesigned_residues "A50 A51 A52 A53" \
    --out_folder "output/"

Batch processing multiple structures:

# Create a JSON file listing inputs
echo '{"1": "input1.pdb", "2": "input2.pdb"}' > input_list.json

python run.py \
    --pdb_path_multi "input_list.json" \
    --out_folder "batch_output/"

With temperature control (higher = more diverse):

python run.py \
    --pdb_path "protein.pdb" \
    --temperature 0.2 \
    --out_folder "output/"

Key Parameters

Parameter	Description	Default
`--model_type`	Model variant: `protein_mpnn`, `ligand_mpnn`, `soluble_mpnn`, etc.	`protein_mpnn`
`--temperature`	Sampling temperature (0.1-1.0). Lower = more conservative	0.1
`--number_of_batches`	Number of sequences to generate	1
`--batch_size`	Sequences per batch	1
`--fixed_residues`	Space-separated residues to keep unchanged	None
`--redesigned_residues`	Only design these residues	All
`--bias_AA`	Bias toward specific amino acids	None

Model Types

Model	Use Case
`protein_mpnn`	Standard protein sequence design
`ligand_mpnn`	Design with small molecule context
`soluble_mpnn`	Bias toward soluble sequences
`global_label_membrane_mpnn`	Membrane protein design
`per_residue_label_membrane_mpnn`	Fine-grained membrane design

Understanding the Output

Output directory structure:

output/
├── seqs/
│   └── protein.fa          # Designed sequences
├── backbones/
│   └── protein.pdb         # Input structure
└── packed/
    ├── protein_1.pdb       # Design 1 with side chains
    └── protein_2.pdb       # Design 2 with side chains

FASTA output format:

>protein, score=1.234, seq_recovery=0.456
MVKLTAEGSE...

score: Negative log-likelihood (lower = better fit to backbone)
seq_recovery: Fraction matching native sequence (if provided)

Troubleshooting

Common Issues

“RuntimeError: CUDA out of memory”:

Use CPU instead: remove CUDA module and run without GPU
Reduce --batch_size
LigandMPNN is efficient; usually not memory-limited

PDB parsing errors:

Ensure PDB has proper formatting
Remove alternate conformations: keep only “A” conformers
Check that ligand has proper atom naming

Ligand not recognized:

Ensure ligand is in the PDB file with HETATM records
Use --ligand flag to specify ligand residue name
Check that ligand coordinates are reasonable

Low sequence diversity:

Increase --temperature (e.g., 0.2 or 0.3)
Increase --number_of_batches
Use different random seeds

Side chain clashes in output:

This is expected - downstream relaxation is recommended
Use PyRosetta or Rosetta FastRelax
Or validate with your structure prediction tool of choice