3. LigandMPNN

LigandMPNN (paper, code) is a deep learning model for context-aware protein sequence design. It extends ProteinMPNN to handle small molecules, metal ions, and other non-protein components in protein design tasks.

Live Workshop Session

πŸŽ₯ Live workshop recording β€” Inverse folding using ProteinMPNN

Why Use LigandMPNN?

  • Ligand-aware design: Design sequences that account for bound cofactors, substrates, or drug molecules
  • Context preservation: Maintain interactions with metals, DNA, RNA, or other molecules
  • Side chain packing: Evaluate and optimize side chain conformations
  • Flexible residue control: Fix, bias, or vary specific positions

Related Tools: Use with RFdiffusion2 for backbone design, or BindCraft for complete binder design pipelines.

Resource Requirements

Resource Minimum Recommended Notes
GPU RAM 4 GB 16 GB Scales with protein size
CPU RAM 8 GB 16 GB CPU-only is viable but slower
Disk Space 2 GB 5 GB Model weights
Python 3.9+ 3.11 Required

Preparation

Mark as complete

Prerequisites:

  • Completed HPC Setup guide
  • Conda/Mamba installed
  • Git installed

Verify your environment:

python --version    # Should be 3.9+
nvcc --version      # For GPU support (optional)

Installation

Mark as complete

  1. Clone the LigandMPNN repository:
git clone https://github.com/dauparas/LigandMPNN.git
cd LigandMPNN
  1. Download the model parameters: Note: This step requires internet access. If your compute node doesn’t have internet, run this on a login node.
bash get_model_params.sh "./model_params"

Expected download: ~500 MB of model weights.

  1. Create a new conda environment:
mamba create -n ligandmpnn_env python=3.11
mamba activate ligandmpnn_env
  1. Install dependencies:
pip install -r requirements.txt

This installs PyTorch, NumPy, and ProDy for PDB file handling.

Testing the Installation

Mark as complete

Run a test design on the provided example structure:

python run.py \
    --seed 111 \
    --pdb_path "./inputs/1BC8.pdb" \
    --out_folder "./outputs/test_output"

Success indicators:

  • Command completes without errors
  • Output folder contains:
    • seqs/1BC8.fa - Designed sequences in FASTA format
    • backbones/1BC8.pdb - Input backbone (for reference)
    • packed/1BC8_1.pdb - Structure with designed side chains

Expected runtime: <1 minute on GPU, ~5 minutes on CPU.

HPC Job Script

#!/bin/bash
#SBATCH --job-name=ligandmpnn
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out

module load cuda/12.1

# source ~/.bashrc  # Optional: Source shell profile if needed
mamba activate ligandmpnn_env

cd /path/to/LigandMPNN

python run.py \
    --model_type "ligand_mpnn" \
    --seed 111 \
    --pdb_path "./inputs/my_protein.pdb" \
    --out_folder "./outputs/my_design" \
    --number_of_batches 10

Usage Examples

Basic protein design (no ligand):

python run.py \
    --pdb_path "protein.pdb" \
    --out_folder "output/"

Design with ligand context:

python run.py \
    --model_type "ligand_mpnn" \
    --pdb_path "protein_ligand.pdb" \
    --out_folder "output/"

Fix specific residues (keep them unchanged):

python run.py \
    --pdb_path "protein.pdb" \
    --fixed_residues "A10 A20 A30" \
    --out_folder "output/"

Design only specific positions:

python run.py \
    --pdb_path "protein.pdb" \
    --redesigned_residues "A50 A51 A52 A53" \
    --out_folder "output/"

Batch processing multiple structures:

# Create a JSON file listing inputs
echo '{"1": "input1.pdb", "2": "input2.pdb"}' > input_list.json

python run.py \
    --pdb_path_multi "input_list.json" \
    --out_folder "batch_output/"

With temperature control (higher = more diverse):

python run.py \
    --pdb_path "protein.pdb" \
    --temperature 0.2 \
    --out_folder "output/"

Key Parameters

Parameter Description Default
--model_type Model variant: protein_mpnn, ligand_mpnn, soluble_mpnn, etc. protein_mpnn
--temperature Sampling temperature (0.1-1.0). Lower = more conservative 0.1
--number_of_batches Number of sequences to generate 1
--batch_size Sequences per batch 1
--fixed_residues Space-separated residues to keep unchanged None
--redesigned_residues Only design these residues All
--bias_AA Bias toward specific amino acids None

Model Types

Model Use Case
protein_mpnn Standard protein sequence design
ligand_mpnn Design with small molecule context
soluble_mpnn Bias toward soluble sequences
global_label_membrane_mpnn Membrane protein design
per_residue_label_membrane_mpnn Fine-grained membrane design

Understanding the Output

Output directory structure:

output/
β”œβ”€β”€ seqs/
β”‚   └── protein.fa          # Designed sequences
β”œβ”€β”€ backbones/
β”‚   └── protein.pdb         # Input structure
└── packed/
    β”œβ”€β”€ protein_1.pdb       # Design 1 with side chains
    └── protein_2.pdb       # Design 2 with side chains

FASTA output format:

>protein, score=1.234, seq_recovery=0.456
MVKLTAEGSE...
  • score: Negative log-likelihood (lower = better fit to backbone)
  • seq_recovery: Fraction matching native sequence (if provided)

Troubleshooting

WarningCommon Issues

β€œRuntimeError: CUDA out of memory”:

  • Use CPU instead: remove CUDA module and run without GPU
  • Reduce --batch_size
  • LigandMPNN is efficient; usually not memory-limited

PDB parsing errors:

  • Ensure PDB has proper formatting
  • Remove alternate conformations: keep only β€œA” conformers
  • Check that ligand has proper atom naming

Ligand not recognized:

  • Ensure ligand is in the PDB file with HETATM records
  • Use --ligand flag to specify ligand residue name
  • Check that ligand coordinates are reasonable

Low sequence diversity:

  • Increase --temperature (e.g., 0.2 or 0.3)
  • Increase --number_of_batches
  • Use different random seeds

Side chain clashes in output:

  • This is expected - downstream relaxation is recommended
  • Use PyRosetta or Rosetta FastRelax
  • Or validate with your structure prediction tool of choice