Status (2026-04-10)

Three files in coot-utils/:

  • coot-coord-utils-gemmi.hh — header
  • coot-coord-utils-gemmi.cc — implementations
  • test-coord-utils-gemmi.cc — 26 comparison tests

Added to Makefile.am: .cc in library sources, .hh in installed headers, test binary in check_PROGRAMS.

All 26 tests pass. Build playground: build-for-claude/coot-utils/


Critical discovery: PDBv3 atom names

mmdb uses 4-character padded atom names ( CA , ` N , C ). gemmi uses unpadded names (CA, N, C). The codebase has ~90 // PDBv3 FIXME` markers at sites using padded names.

The problem with copy_from_mmdb(): gemmi’s copy_from_mmdb() copies m_atom.label_atom_id into atom.name. Despite being documented as “not aligned” in mmdb_atom.h, mmdb pads label_atom_id identically to name when reading PDB files. Verified with pure mmdb (no coot code involved) — both fields are 4 chars padded.

Current workaround: coot::trim_atom_names(gemmi::Structure &st) in coot-coord-utils-gemmi.{hh,cc}. Strips leading/trailing spaces from all atom names. Must be called after every copy_from_mmdb(). The test file wraps this in a helper:

gemmi::Structure gemmi_from_mmdb(mmdb::Manager *mol) {
   gemmi::Structure st = gemmi::copy_from_mmdb(mol);
   coot::trim_atom_names(st);
   return st;
}

Proper fix (TODO): This should be fixed in gemmi’s copy_from_mmdb() itself — propose the change upstream to Marcin.

All gemmi functions use unpadded atom names ("CA", "N", "C", "CB", etc.). This is the correct style going forward. The // PDBv3 FIXME markers in the codebase (~90 sites) map the work needed when migrating each file.

The hardest areas for the PDBv3 transition will be ideal/ and geometry/ — padded atom names are deeply embedded in the restraint matching system.


mmdb symmetry setup

mmdb needs syminfo.lib to resolve spacegroup operations. Without it, GetTMatrix() silently fails and reports no symmetry. The test calls setup_syminfo() (from utils/setup-syminfo.hh) in main() before running tests. Without this, a P 21 21 21 structure was incorrectly reported as having no symmetry by the mmdb version.


What was converted (25 functions + trim_atom_names)

Utility

| Function | Notes | |———-|——-| | trim_atom_names(gemmi::Structure&) | Strips padded atom names after copy_from_mmdb() |

Atom-level (4, in coot::)

Function Notes
distance(gemmi::Atom&, gemmi::Atom&) Manual sqrt, matches mmdb overload
angle(gemmi::Atom&, gemmi::Atom&, gemmi::Atom&) Returns degrees, clamps cos_theta
co(gemmi::Atom&) Returns clipper::Coord_orth
is_hydrogen_atom(gemmi::Atom&) Delegates to at.is_hydrogen()

Residue-level (6, in coot::util::)

Function Notes
get_residue_centre Mean of all atom positions
get_CA_position_in_residue Finds atom named "CA" (unpadded)
get_CB_position_in_residue Finds atom named "CB" (unpadded)
is_nucleotide Hardcoded residue name list
residue_has_hydrogens_p Uses at.is_hydrogen()
residue_has_hetatms Checks res.het_flag == 'H'

Chain-level (5, in coot::util::)

Function Notes
min_and_max_residues Uses res.seqid.num.value
min_resno_in_chain Pair<bool,int> return
max_resno_in_chain Pair<bool,int> return
residue_types_in_chain Returns sorted unique names via set
get_number_of_protein_or_nucleotides Nucleotide via name; protein via entity_type == Polymer

Structure-level (15, mix of coot:: and coot::util::)

Function Notes
centre_of_molecule First model only
radius_of_gyration First model only
mol_has_symmetry Uses find_spacegroup(), checks operations().order() > 1
mol_is_anisotropic Checks any at.aniso.nonzero()
get_position_hash Sum of x-differences (matches mmdb algorithm, no TER atoms)
residue_types_in_molecule Set-based unique types
non_standard_residue_types_in_molecule Uses standard_residue_types() from coot-coord-utils.hh
chains_in_molecule Chain name list
number_of_residues_in_molecule Sum of chain.residues.size()
max_number_of_residues_in_chain Max chain length
number_of_chains models[0].chains.size()
max_resno_in_molecule Max seqid across all chains
max_min_max_residue_range Max (max-min+1) across chains (inclusive, matching mmdb)
alt_confs_in_molecule Includes "" for non-alt atoms (matching mmdb)
extents Min/max bounding box
median_position Sort-based median per axis
count_cis_peptides Omega torsion from geometry, C-N distance < 3.0 A check

Bugs found and fixed during testing

Issue Root cause Fix
get_CA_position_in_residue failed Atom names padded by mmdb, gemmi searched for "CA" not " CA " trim_atom_names()
get_position_hash mismatch (57 vs 178516) Gemmi version summed absolute positions, mmdb sums x-differences Rewrote to match mmdb algorithm
mol_has_symmetry mismatch (mmdb=0, gemmi=1) mmdb lacked syminfo.lib; gemmi correctly found P 21 21 21 Added setup_syminfo(), changed gemmi impl to use find_spacegroup()
max_min_max_residue_range off by 1 mmdb uses inclusive range (max-min+1), gemmi used (max-min) Added +1
alt_confs_in_molecule mismatch mmdb includes empty string for non-alt atoms, gemmi skipped them Include "" for altloc == '\0'
count_cis_peptides mismatch Gemmi used \|omega\| < 30 deg, mmdb uses \|180-pos_torsion\| > 90 + distance check Matched mmdb’s distortion/distance logic
Segfault on all tests replace_all on copy_from_mmdb caught the helper’s own body → infinite recursion Fixed gemmi_from_mmdb() to call gemmi::copy_from_mmdb()
Missing includes <fstream>, <set>, "coot-coord-utils.hh", "utils/setup-syminfo.hh" Added

Remaining issues to watch

  1. get_number_of_protein_or_nucleotides: Uses res.entity_type == gemmi::EntityType::Polymer — may not be populated by copy_from_mmdb(). No test for this yet.

  2. residue_has_hetatms: Checks res.het_flag == 'H'. copy_from_mmdb() does set het_flag (verified in gemmi source line 345), so this should be OK.

  3. is_nucleotide: Hardcoded name list. The mmdb version may use a different heuristic. Should verify they agree on a nucleotide-containing structure.

  4. get_position_hash: mmdb and gemmi hashes differ numerically (TER atoms affect mmdb’s x-difference chain). Test verifies gemmi self-consistency rather than cross-comparing. This is fine — the hash just needs to detect when atoms move.

  5. First-model-only pattern: Structure-level functions use for (...) { ... break; }. Works but could use st.models[0] directly with an empty check.


What was NOT converted (~75+ functions remaining in coot-coord-utils.hh)

Residue operations (high value for migration)

  • get_residue(chain_id, resno, ins_code, mol) — residue lookup
  • get_atom(atom_spec, mol) — atom lookup by spec
  • deep_copy_residue — residue duplication
  • get_residue_name(chain_id, resno, ins_code, mol) — name lookup
  • residues_near_residue — spatial queries
  • residues_near_position — spatial queries
  • get_waters — water identification

Atom selection / searching

  • atoms_with_zero_occ — find zero-occupancy atoms
  • fill_residue_alt_confs — enumerate alt confs per residue
  • contact_info — atom contacts

Whole-molecule operations

  • sort_chains, sort_residues — reordering
  • shift_to_origin — coordinate transformation
  • copy_and_delete_hydrogens — hydrogen stripping
  • move_waters_to_last_chain

Selection API (~1,450 calls across codebase)

  • NewSelection / SelectAtoms / GetSelIndex / DeleteSelection — hardest migration target
  • gemmi equivalent: range-based for loops with conditionals, or gemmi::Selection
  • Functions in coot-map-utils.hh — separate from coord-utils migration

Approach notes

  • gemmi::Structure created on the fly via gemmi::copy_from_mmdb(mol) + trim_atom_names() — no persistent dual representation
  • Tests compare mmdb and gemmi results on the same molecule, tolerance-based for floats
  • copy_from_mmdb() will need extension for full migration (missing: secondary structure, title/header, resolution, biomolecule info)
  • The mmdb selection API is the biggest migration challenge — ~1,450 call sites
  • SSM (structure superposition) may not have a gemmi equivalent — can be side-lined
  • TER atoms: gemmi doesn’t have them. They only matter for PDB file output. Skip them in all logic.

Next steps: Prioritize converting residue-lookup functions (get_residue, get_atom) and spatial queries (residues_near_residue/position) — most commonly used across the codebase.