Plan: mmdb → gemmi conversion
Plan: mmdb → gemmi conversion
A roadmap for migrating Coot’s molecular structure backend from mmdb2 to gemmi.
Companion to gemmi-conversion-review.md (which documents the 26 functions
already converted in coot-coord-utils-gemmi.{hh,cc}).
Scope by the numbers
| Pattern | Call sites | Difficulty |
|---|---|---|
Selection API (NewSelection / SelectAtoms / GetSelIndex / SeekContacts) |
~1,613 | high |
PutUDData / GetUDData |
~632 | high (no native gemmi equivalent) |
atom->residue back-pointer |
~172 | medium |
// PDBv3 FIXME (padded atom names) |
~90 | medium |
coot-coord-utils.hh functions still to port |
~75 | low–medium |
Counts are grep-derived from the main coot tree (excluding ccp4mg-utils/).
Three cross-cutting design problems
These need a chosen approach before bulk conversion, otherwise each file gets re-debated.
Problem A: atom->residue (no gemmi equivalent)
mmdb::Atom carries a back-pointer to its Residue (and via that, Chain).
gemmi::Atom does not. Four approaches, in preference order:
-
Restructure loops to carry residue context — change
for (atom : flat_atom_list)tofor (chain) for (res) for (atom). The residue is then in the enclosing scope. Most gemmi-native, removes the back-pointer requirement entirely. Works for the majority of the 172 sites. -
gemmi::CRAtriples — when you must carry an atom out of its scope (return value, container), carry{Chain*, Residue*, Atom*}instead ofAtom*. Defineusing coot::cra_t = gemmi::CRA;for grep-ability. -
NeighborSearch::Mark— when atoms come from a spatial query, theMarkalready storeschain_idx/residue_idx/atom_idxand hasto_cra(model). Use this directly; don’t synthesize a back-pointer. -
gemmi::Model::find_cra(atom*)linear-scan helper — last-resort retrofit. Only acceptable for cold paths.
Decision rule: if you find yourself reaching for #4, the surrounding code probably needs restructuring (#1) instead.
Inventory of Atom*-returning functions whose callers use ->residue
The 259 C++ uses of atom->residue (and atom->residue->chain) trace back
to four kinds of source. Categorising them is essential because each needs a
different conversion strategy.
Category A — single-atom lookup functions returning mmdb::Atom *
These take a spec/CID and return a single pointer. Callers routinely use
->residue on the result. These are the ones whose signature must change
(return gemmi::CRA or std::optional<gemmi::CRA> instead of a bare
Atom*):
| Function | Header |
|---|---|
coot::util::get_atom(atom_spec_t, mmdb::Manager*) |
coot-utils/coot-coord-utils.hh:773 |
coot::util::get_atom_using_fuzzy_search(atom_spec_t, mmdb::Manager*) |
coot-utils/coot-coord-utils.hh:777 |
coot::util::get_atom(res_1, res_2, atom_name_quad, idx) |
coot-utils/coot-coord-utils.hh:580 |
chiral_4th_atom(residue_p, at_centre, ...) |
coot-utils/coot-coord-utils.hh:596 |
intelligent_this_residue_mmdb_atom(res_p) |
coot-utils/coot-coord-utils.hh:1247 |
get_first_atom_with_atom_name(name, ...) |
coords/mmdb-extras.hh:127 |
previous_non_riding_atom(atom_vector, ...) |
coot-utils/coot-shelx.hh:280 |
atom_spec_t::get_atom(mmdb::Manager*) |
geometry/residue-and-atom-specs.hh:199 |
coot::molecule_t::cid_to_atom(cid) |
api/coot-molecule.hh:621 |
coot::molecule_t::get_atom(atom_spec) |
api/coot-molecule.hh:655 |
molecules_container_t::get_atom_using_cid(imol, cid) |
api/molecules-container.hh:1366 |
molecules_container_t::get_atom(imol, atom_spec) |
api/molecules-container.hh:1385 |
graphics_info_t::get_atom(imol, spec) |
src/graphics-info.h:1347 |
graphics_info_t::find_atom_in_moving_atoms(spec) |
src/graphics-info.h:1494 |
graphics_info_t::get_moving_atom(pick_info) |
src/graphics-info.h:1499 |
Note: coot::util::get_atom(spec, mmdb::Residue*) at
coot-coord-utils.hh:781 is not in this list — the residue is already
the input, so the back-pointer is trivially known. Its gemmi equivalent can
keep returning gemmi::Atom*.
Category B — mmdb::Atom** selection arrays
These don’t return a single Atom* but a flat array. Callers iterate with
selection[i]->residue. Top offenders:
coords/Bond_lines.cc(~23 sites)coot-utils/merge-atom-selections.cc(~12 sites)ideal/torsion-bonds.cc(~6 sites)coot-utils/coot-coord-utils.cc(~10 sites, mostly internal selection use)
The mmdb sources are:
mmdb::Manager::GetSelIndex(handle, atom_selection, n_atoms)mmdb::Manager::SeekContacts(...)→mmdb::Contact*whoseid1/id2index back into anatom_selection
These don’t translate function-by-function. They restructure into nested
for (chain) for (res) for (atom) loops or gemmi::NeighborSearch —
whose Mark::to_cra(model) gives back chain + residue + atom explicitly.
Category C — refinement’s internal atom[i] array
restraints_container_t::atom is an mmdb::Atom** member of the refinement
container. The entry point is
restraints_container_t::get_atom(int i) const(ideal/simple-restraint.hh:2396)
used throughout ideal/ as atom[i]->residue,
atom[restraint.atom_index_1]->residue, etc. ~80+ sites across
simple-restraint.cc, ng.cc, mods.cc, make-restraints.cc,
trans-peptide.cc, crankshaft.cc.
Strategy: store std::vector<gemmi::CRA> (or parallel
std::vector<int> residue_index, chain_index) instead of mmdb::Atom**.
get_atom(i) returns the Atom* slot of the CRA; add a new
get_residue(i) / get_cra(i) for the rest. Combined with the
atom.serial-as-stable-index decision in Problem B below, this becomes a
clean refactor.
Category D — at->residue->chain (one extra hop)
A handful of sites need the chain too, e.g.
coot-utils/coot-coord-utils.cc:5268—at->residue->chain == moving_chainanalysis/daca.cc:1082—at->residue->chain == reference_residue_p->chain
CRA covers these natively — cra.chain == ... — without a second lookup.
Heat map
Per-file atom->residue counts (top 10), useful for prioritising:
| File | Count |
|---|---|
coords/Bond_lines.cc |
23 |
coot-utils/coot-coord-utils-glyco.cc |
21 |
src/graphics-info-defines.cc |
13 |
ideal/simple-restraint.cc |
12 |
coot-utils/merge-atom-selections.cc |
12 |
ideal/ng.cc |
10 |
coot-utils/coot-coord-utils.cc |
10 |
src/molecule-class-info-other.cc |
9 |
coot-utils/stack-and-pair.cc |
9 |
ideal/trans-peptide.cc |
8 |
Problem B: PutUDData / GetUDData replacement (~632 sites)
gemmi::Atom has these spare-ish slots (from model.hpp lines 129–144):
| Slot | Type | Default use | Hijackable? |
|---|---|---|---|
serial |
int |
PDB serial number | Yes — already standard idiom for “atom index handle” |
flag |
char |
“custom flag” | Yes — boolean / small-enum tags |
tls_group_id |
short |
TLS group | Yes if no TLS in play |
fraction |
float |
Refmac D fraction | Yes if no D refinement |
UDD usage in coot falls into three categories:
| Category | Example | Replacement |
|---|---|---|
| Atom-index handle (refinement) | atom[i]->PutUDData(udd_atom_index_handle, i) then at->GetUDData(...,ai) to map back into a flat coord array |
Use atom.serial = i. Stable for the duration of the refinement. The site at ideal/simple-restraint.cc:4367 is the canonical example. |
| Boolean / enum flags | at->PutUDData(uddHnd, 0/1) for “fixed” / “moving” / “in residue X” |
Either atom.flag (one char) or external std::vector<bool> indexed by atom.serial |
| Multi-handle data | Separate handles for bond / angle / torsion membership | External std::unordered_map<int, T> keyed by atom.serial, or parallel std::vector<T> sized to atom count |
Decision rule: make atom.serial the canonical stable index; everything
else is an external container keyed on it. Never use raw gemmi::Atom* as a
map key — vector reallocation invalidates the pointer.
Risk: atom.serial is also written by PDB parsing, so any code that reads
“PDB serial” must do so before refinement repurposes the field. Document this
clearly. Alternatively: introduce coot::set_atom_indices(structure) and
treat serial as ours from that point. Note that serial also matters for
CONECT records, ligand parsing and PDB round-tripping — repurposing it is not
free.
Float-typed UDD (RegisterUDReal) — 6 sites
In addition to the integer handles, coot uses RegisterUDReal to attach a
float per atom in 6 places:
| Site | Purpose |
|---|---|
coot-utils/coot-shelx-ins.cc:119 |
Riding-atom negative U value |
coot-utils/q-score.hh:264 |
Q-score per atom |
coot-utils/hole.cc:58 |
Atom radius (hole calculation) |
coords/Bond_lines.cc:5439 |
Rainbow colour position |
coords/Bond_lines.cc:5504 |
B-factor fraction point |
src/molecule-class-info.cc:10577 |
B-factor scale (registered on hierarchy, not atom) |
gemmi::Atom::fraction (float, “custom value, one use is Refmac’s
ccp4_deuterium_fraction”) is the natural single-payload slot — coot does no D
refinement, so it is free. Good for the single-float uses (Q-score, radius,
rainbow position). For more than one float per atom at once, the same
external-vector pattern applies: std::vector<float> indexed by the stable
atom index.
Design option: extending gemmi::Atom upstream
Three shapes were considered for a richer per-atom user slot. Memory cost is the deciding factor (cryo-EM structures reach ~1M atoms):
| Approach | Per-atom cost (unused) | Per-atom cost (used) | Verdict |
|---|---|---|---|
int user_defined_int |
4 B always | 4 B | Narrow (one slot) but trivial; doesn’t solve multi-handle |
void *user_data |
8 B always | 8 B + caller payload | Most flexible; C-API-standard “hang anything here”; best upstream pitch |
std::unordered_map<int,int> user_int_map per atom |
~56 B always (empty) | ~56 B + ~32 B/entry | Faithful to mmdb’s model but worst layout: ~56 MB of empty maps on a 1M-atom structure, pointer indirection + allocator pressure per access. Rejected. |
The per-atom map is the same data shape as the recommended external
container but with the worst possible memory layout — it taxes every atom
that has no UDD at all. The external std::unordered_map keyed on the stable
atom index is pay-only-for-what-you-use and needs no gemmi change.
Plan of record:
- Propose
void *user_data(oruint32_t user_tag) upstream to gemmi, pitched generically (“one stable per-atom hook for downstream tools, equivalent to a single mmdbUDR_ATOMhandle”) — not as a coot-specific feature. Best chance of acceptance; serves any tool. - Design the migration as if it lands. If accepted, the heaviest single
handle (
udd_atom_index_handle) and the single-float cases hang off it cleanly without touchingserial. - Fall back to
atom.serial(int) /atom.fraction(float) only if the upstream proposal stalls. - Either way, the multi-handle pattern (refinement registering several int
handles on the same atom simultaneously) still uses an external
std::vector<T>/std::unordered_map<int,T>keyed by the stable atom index. No gemmi change displaces that.
Problem C: Selection API (~1,613 sites)
mmdb’s NewSelection / SelectAtoms / GetSelIndex returns mmdb::Atom**
flat arrays. Replacements by pattern:
| mmdb pattern | gemmi replacement |
|---|---|
| Whole-molecule atom scan | nested for (chain) for (res) for (atom) |
| Single chain | for (auto &res : chain.residues) |
| Spatial / contacts | gemmi::NeighborSearch::find_neighbors (returns Mark*) |
| Atom-spec lookup | chain.find_residue_group(SeqId) then res.find_atom(name, altloc) |
| Text-spec (“/A/12-30/CA”) | gemmi::Selection from select.hpp |
| “All atoms in residue list” | direct iteration over res.atoms |
SeekContacts |
NeighborSearch::populate() once, then find_neighbors per query atom |
Decision rule: convert by pattern in batches, not file-by-file. After Problems A and B are solved, most selection sites become trivial.
Other problem corners
- SSM (structure superposition) — no gemmi equivalent. Either keep the mmdb sub-dependency just for SSM, or replace with clipper or a hand-rolled Kabsch. Park this until last.
-
TER atoms — gemmi has no TER atoms in memory. Chain ends are implicit (end of
chain.residues, or the polymer / non-polymer boundary determined byentity_type). On PDB output, the writer synthesises TER records from that structural information. From~/gemmi/include/gemmi/to_pdb.hpp:bool ter_records = true; // Write TER records (default on) bool numbered_ter = true; // TER gets its own serial number bool ter_ignores_type = false; // Put TER after last atom in chain // (even water) — reverts to the older // "TER after the last atom regardless" // behaviourDefault placement: TER goes after the last polymer residue of a chain; non-polymer (waters, ligands) follow without a preceding TER. Set
ter_ignores_type = truefor the older behaviour.Migration implication: mmdb code routinely does
if (!at->isTer())checks to skip dummy TER atoms in iteration — these are unnecessary and meaningless when iterating gemmi residues. They should just be removed during conversion, not translated. Also: anything that depends onentity_typebeing correctly populated for sensible TER placement should verify thatcopy_from_mmdb(or the upstream PDB read) sets it. - PDBv3 atom names — already understood, ~90 FIXME markers. Resolved per-file as conversion proceeds. All gemmi code uses unpadded names natively.
copy_from_mmdbgaps — secondary structure, headers, resolution, biomolecule info not currently copied. Extend or accept loss for now.label_atom_idpadding bug —trim_atom_names()workaround in place; upstream fix to gemmi still TODO (propose to Marcin).
Phase plan
Phase 1 — Finish the utility layer in coot-coord-utils-gemmi
~25 more functions, in this order (each with a comparison test against mmdb):
get_residue(chain_id, res_no, ins_code, st)— wrapschain.find_residue_group(SeqId)get_atom(atom_spec_t, st)— trims spec name, thenres.find_atomresidues_near_position(pt, st, radius)—NeighborSearch::find_neighborsresidues_near_residue(res, st, radius)— same, looped over reference residue’s atomsresidues_near_residues(st, dist)— full neighbour mapget_waters,get_hetgroups,atoms_with_zero_occupancydeep_copy_residue,copy_and_delete_hydrogenssort_chains,sort_residues,shift_to_origin,move_waters_to_last_chain
Phase 2 — Lock down the cross-cutting decisions
- Write a short “gemmi conversion conventions” doc covering Problems A/B/C.
- Add
coot::set_atom_indices(gemmi::Structure&)helper that walks the structure assigningatom.serial = global_index. - Add convenience
coot::cra_talias. - Pick one small file (e.g.
analysis/stats.hhor a chunk ofcoot-utils/peak-search.cc) and convert it end-to-end as a worked example demonstrating the conventions. This becomes the reference for everyone.
Phase 3 — File-by-file migration, easiest first
Order by reverse-dependency depth (leaves first):
analysis/— validation, mostly read-only- remaining
coot-utils/files density-contour/— already mostly clipperMoleculesToTriangles/— coordinate-drivenligand/— heavy selection uselidia-core/— partly RDKit alreadygeometry/— restraints, padded names everywhereideal/— refinement, heaviest UDD use; hardestapi/—molecules_container_tmethod-by-methodsrc/— GUI layer, last
For each file: keep mmdb at the public interface initially (caller passes
mmdb::Manager*, the function does copy_from_mmdb internally). Once enough
callers are converted, flip the signature to take gemmi::Structure&.
Phase 4 — Storage migration in molecule_t / molecule_class_info_t
Add gemmi::Structure structure_; alongside the existing
mmdb::Manager *atom_sel.mol.
Three strategies:
- Dual-storage with sync (start here): both representations live, gemmi rebuilt from mmdb on modification. Lets gemmi paths be validated against mmdb.
- gemmi-primary, mmdb lazy (when ≥80% of read paths converted): gemmi is authoritative; mmdb built on demand for unconverted code.
- mmdb-only legacy mode (escape hatch): for SSM and any unported code.
Phase 5 — Cut mmdb
Remove the conversion layer. Drop mmdb dependency (or keep it isolated to SSM).
What to do this week
- Write
get_residue/get_atom/residues_near_position/residues_near_residue+ tests (~1–2 days) - Write the conventions doc +
set_atom_indiceshelper (~half day) - Convert one small file end-to-end as a worked example (~1 day)