IO Pipelines

The crate exports load_scene, load_trajectory, and load_frequency_data, each selecting the appropriate parser based on file extension hints and content sniffing.
formats/ contains format-specific parsers:
- xyz/, pdb/, cif/, vasp/, sdf.rs handle canonical structural formats.
- gaussian/ and nwchem/ include streaming parsers that expose run summaries (GaussianRunSummary, NwchemRunSummary), stage boundaries, and trajectory extraction helpers used by the viewer’s tasks panel.
- cube/ and nbo/ parse volumetric grids and natural bond orbital data for orbital visualisation.
Streaming helpers (CountingReader, ProgressFn) support background loading with live progress indicators (background module in the UI uses these hooks).
registry.rs implements the loader table; add new formats here when extending support.

6.1 Adding a New Chemistry Format

Orbitron does not require a trait implementation to register a file format; instead it follows a convention-based layout under io/pipelines/src/formats. To keep onboarding fast, apply the pattern below (Gaussian/NWChem are end-to-end references).

Bootstrap the module
- Copy the structure of a neighbouring format (e.g. xyz, cif, or gaussian) so parsing, streaming, and summary helpers live under io/pipelines/src/formats/<format>/.
- Provide a detect_<format> helper that accepts (path, contents) and returns Option<bool> so load_scene can cheaply sniff the input.
- Expose parse_<format>_scene (and optionally _trajectory/_summary_stream) returning Result<SceneGraph>/Result<Trajectory>/summary structs.
Register the loader
- Add detection hooks to io/pipelines/src/formats/mod.rs so load_scene, load_trajectory, and load_frequency_data route to the new parser.
- Insert a LoaderFactory entry in io/pipelines/src/registry.rs if the format should appear in the dynamic registry used by automation clients.
Offer streaming & summaries (optional but recommended)
- For multi-stage logs, mirror the Gaussian/NWChem approach: create a summary.rs with a RunSummary and Stage/TaskBoundary structs that record byte offsets. Re-export them from the module’s mod.rs so the UI/CLI can lazily pull individual stages.
- Implement streaming (parse_*_stream) when files can be multi-GB; follow the line-walking idioms in gaussian::mod or nwchem::streaming.
Add fixtures & tests
- Place representative fixtures under io/pipelines/tests/fixtures/<format>/.
- Add targeted unit/integration tests in io/pipelines/tests/ (see gaussian_summary.rs, nwchem_summary.rs, cube_loading.rs for patterns).
- If the parser includes summary/streaming helpers, add smoke tests that validate classification, byte boundaries, and metadata counts.
Wire higher-level surfaces
- Update UI/TUI task panels if the summary types expose new metadata.
- Extend CLI smoke tests or regression suites so CI exercises the format end-to-end.
Document & export
- Record any user-facing instructions in docs/USER_GUIDE.md (file type, known limitations).
- If Python bindings should expose the format, add corresponding methods in extensions/python-bridge and rebuild via maturin develop.

Checklist: detector hook ✅ parser skeleton ✅ streaming helper ✅ registry entry ✅ fixtures/tests ✅ docs ✅

This sequence keeps new formats aligned with the rest of the pipeline, ensures background streaming remains available, and gives downstream clients (GUI, CLI, Python) consistent abstractions.

6.2 Canonical Pipeline & Attachments

Orbitron’s canonical pipeline decouples viewers and automation tooling from program-specific output formats. Every migrated parser produces a CanonicalOutcome that ultimately powers load_scene, load_trajectory, the CLI, and downstream bundles.

Key modules and types

io/pipelines/src/canonical/mod.rs defines the canonical schema (CanonicalDocument), builder ergonomics (CanonicalBuilder), attachment metadata (AttachmentRef), and the enriched CanonicalOutcome containing optional fast-path conversions plus the attachment payloads.
formats::<format>::canonical houses the format-specific adapter that translates raw parse results into a canonical document and populates attachments.
formats::load_canonical returns the full CanonicalOutcome. It is preferred over load_canonical_document when you need access to attachments or pre-computed scene/trajectory data.
automation/cli exposes orbitron canonical export, which writes the manifest to <bundle>/manifest.json and each attachment to <bundle>/attachments/<sha>.<ext>.

Building a canonical parser

Parse the format as usual (geometry snapshots, trajectories, frequencies, thermochemistry) and feed the results into CanonicalBuilder. At a minimum, populate structure; only advertise the Trajectory/Frequency/Thermochemistry capabilities when the format truly supports them.
- Use canonical::base_builder_with_raw_source* helpers to seed SourceMetadata, Provenance, and a raw_source attachment. Extend with build_mo_coefficients_attachment / build_volumetric_attachment when emitting MO or grid payloads. The volumetric helper accepts a VolumetricAttachmentParams struct describing the grid geometry/metadata and returns both the attachment bytes and the dataset entry so callers can embed the resulting record in their payload.volumetric.datasets[] array.
- For periodic/solid-state formats, prefer the shared helpers:
  - structure_section_from_parts accepts atoms + unit cell + metadata closure so you can skip the boilerplate of creating a temporary SceneBuilder. This is now used by VASP and makes it trivial to plug in future mat-sci parsers (QE, CASTEP, DIRAC, …).
  - build_periodic_electronic_structure(|periodic| { … }) wraps the PeriodicElectronicStructure struct and produces an ElectronicStructure only when periodic data exists. VASP now calls this helper so new periodic formats only implement the callback and reuse the shared schema wiring.
Emit provenance via ProvenanceInput (path + optional checksum). This keeps cache entries and downstream bundle fingerprints deterministic.
Create attachments for any material too large to remain inline. Typical attachments include:
- Raw source log/input (so users can reproduce the canonical document without the original file). Compute a SHA-256 digest, store the bytes, and commit attachment.reference.metadata entries such as original_extension, encoding, and source_path so exporters can restore the file name.
- Molecular orbital coefficients (e.g., Pop=Full outputs). Trim the dense coefficient tables from the inline payload and store them as attachments (gaussian:mo_coefficients, nwchem:mo_coefficients). Hydrate the coefficients by calling CanonicalOutcome::into_scene_graph() / .into_trajectory() before handing snapshots to callers.
- Volumetric grids / trajectory shards: stream the binary data into a Vec<u8> and populate shape/unit metadata. Keep one attachment per logical asset (e.g., HOMO cube, charge density); the standalone CUBE handler demonstrates the pattern by emitting a volumetric_grid attachment plus a manifest dataset entry keyed by the attachment id.
- Pre-computed derived artefacts (e.g., rendered images, multi-format exports) can also be attached; annotate them clearly so consumers can differentiate raw vs derived data.

When emitting program-specific metadata, use ProgramExtrasBuilder from canonical::helpers:
- ProgramExtrasBuilder::new("qe") (or "molpro", "vasp", etc.) exposes fluent insert(...) and with_tasks(&summary.tasks) methods and returns an IndexMap<String, Value> ready for CanonicalBuilder::with_extras. This keeps every format’s extras under a single key and ensures task_count / tasks are populated consistently.
After collecting attachments, call register_attachment_refs(builder, &attachments) followed by outcome_with_attachments(document, attachments) when finishing the outcome. These helpers make it much harder to forget registering raw_source entries or to leak bytes, and they centralise the logic we previously duplicated across every format.
PDB canonical structure: the PDB loader produces canonical documents with explicit CONECT connectivity, unit-cell metadata (when present), and secondary-structure hints encoded as SceneMetadata tags. Downstream consumers get these bonds automatically when calling load_canonical(...).into_scene_graph().
Format-specific expectations
- xyz / sdf / pdb / cif: Always emit a raw_source attachment and use structure_section_from_parts (or _from_snapshot_custom) so metadata wiring stays consistent. These lightweight formats now finish every builder by chaining register_attachment_refs(...); outcome_with_attachments(...), so new text-based structures should mirror that sequence to avoid forgetting attachment references.
- gaussian / nwchem / molpro / molcas / dirac: Populate extras via ProgramExtrasBuilder ("gaussian", "nwchem", etc.), externalise MO coefficients into attachments where applicable, and call build_trajectory_positions_attachment / build_frequency_displacements_attachment so the trajectory/frequency sections reference binary shards when present.
- qe: Use structure_section_from_parts and build_periodic_electronic_structure. All QE metadata should live under extras["qe"] (energies, relax profile, DOS/band/PDOS summaries) via the program extras builder. QE volumetric grids (.xsf) should be routed through the shared volumetric attachment helpers so overlays can reuse the CUBE pipeline.
- vasp: Reuse structure_section_from_parts to build the periodic structure, attach related artefacts (DOSCAR, PROCAR, EIGENVAL, charge grids), and rely on build_periodic_electronic_structure for band/DOS payloads.
- Volumetric bundles (cube single-file handlers, volumetric directories) and standalone NBO summaries: run every attachment (raw sources, grids, population tables) through register_attachment_refs + outcome_with_attachments and keep provenance metadata in sync via base_builder_with_raw_source*. Record dataset entries referencing the attachment id, shape, origin, voxel vectors, and source path.
QE canonical structure and extras: the QE handler currently supports selected SCF and relax outputs (e.g., qm_tests/qe/benzene/scf.out, benzene/relax.out, graphene_scf.out, Si/scf.out, srtio3/srtio3.out), emitting canonical structures with periodic unit cells plus:
- extras["qe"].scf_total_energy_ry / scf_total_energy_hartree / fermi_level_ev,
- a basic SCF task list (extras["qe"].tasks),
- inline band points parsed from QE’s bands (ev): blocks for SCF runs,
- lightweight DOS and band summaries for QE *.dat artefacts (dos_summary, bands_summary),
- PDOS file summaries and packed grids for *.pdos_* outputs (pdos_summary + attachments),
- volumetric grid attachments for cube-style .xsf outputs (payload.volumetric),
- a relax energy profile (relax_profile) when multiple SCF steps are present. These extras power CLI inspect / canonical export flows today and are intended to feed a future QE panel in the viewer.
Cross-program task summaries: io/pipelines::tasks exposes ProgramTaskSummary plus helpers (nwchem_tasks_to_program_summaries, qe_tasks_to_program_summaries, collect_program_task_summaries). Canonical extras obtained from load_canonical_document now drive both orbitron inspect and orbitron info so DIRAC/QE/NWChem tasks show up consistently in CLI output and JSON (program_tasks array). Molpro extras also back the new --molpro-task and --molpro-kind (alias --task) filters on both commands, letting contributors script against specific Molpro modules without re-tokenising the raw logs. Keep the extras JSON stable when adding task metadata so CLI filters remain robust.

Store attachments in the CanonicalOutcome via with_attachment/push_attachment. The document retains only the references; the data lives alongside the manifest and is retrieved through outcome.attachments().

Exporting canonical bundles

Use orbitron canonical export <source> to materialise the bundle. The command writes:
- manifest.json – the canonical document (pretty-print via --pretty).
- attachments/<sha>.<ext> – one file per attachment, where ext defaults to the metadata’s original_extension or bin.
Consumers can reconstruct the CanonicalOutcome by reading the manifest and attachment files; orbitron canonical import already verifies hashes and restores the raw_source payload (with additional extract modes planned for volumetric grids and other artefacts).
- Manage cached bundles with orbitron canonical cache path|list|purge [digest] (cache lives under ORBITRON_CANONICAL_CACHE or the platform cache dir).
When adding new attachment types, update automation/cli/tests/canonical.rs (or add a dedicated test) to assert the file name, byte content, and metadata. This keeps the bundle contract stable. Pair the CLI coverage with a canonical round-trip test (load_canonical(...).into_scene_graph()) so coefficient-style attachments rehydrate before downstream consumers inspect the scene.

Testing checklist

Unit/integration tests should call load_canonical to ensure attachments are present and payloads are hash-stable between runs.
Extend format-specific tests (e.g., gaussian_log.rs, nwchem_out.rs, qe.rs) to assert that document.source.format and attachment/extra metadata match expectations.
Keep the smoke tests (qm_smoke.rs) tolerant of canonical error wording (structure snapshot missing, no vibrational modes) so format migrations do not create brittle failures.

Import pipeline (orbitron canonical import)

orbitron canonical import <bundle> verifies attachment hashes and, when --output is provided, restores the raw_source payload to a user-specified location. Use --id to extract a specific attachment when multiple payloads exist. Additional extraction modes (e.g., writing volumetric grids) will iterate on this base command.
Upcoming work (tracked in CANONICAL_IO_PLAN.md) will hydrate a local cache and teach OrbitronServices to reuse imported bundles before re-parsing raw logs, reusing the same verification flow implemented here.

6.3 Task Metadata & Thermochemistry Pipelines

Gaussian summaries record optimisation energy trajectories and frequency/intensity tables (GaussianStageMeta::opt_energy_trajectory and ::frequency_modes). The TUI analysis panel reads these fields directly; keep the arrays small by truncating on display rather than during parsing.
NWChem summaries mirror the Gaussian contract with NwchemTaskMeta::opt_energy_trajectory and ::frequency_modes. Frequency tasks are enriched with ThermochemistryData, and Raman analyses are detected as their own NwchemTaskKind::Raman. The canonical builder (and direct summary parser) look for .normal sidecars referenced by the log (Raman scattering data written to …) and store the parsed sticks/samples under NwchemTaskMeta::raman_spectrum. Downstream consumers (CLI, TUI, GUI) can turn that payload into exportable spectra via orbitron_ui_shell::helpers::raman_spectrum_to_ir. Use the same pattern whenever a format exposes auxiliary files (e.g., DIRAC, GAMESS) so helpers stay reusable and UI panels can rely on typed metadata rather than ad-hoc JSON.
NWChem Task Outcome Detection: Each NwchemTaskSummary includes a RunOutcome field (Success, Incomplete, Failed, or Unknown) that tracks whether the task completed normally. The parser detects incomplete tasks by looking for expected termination markers (“Task times”) and checking for error conditions. Tasks have two key methods:
- is_complete() returns true only if outcome == RunOutcome::Success
- has_usable_data() returns true if the task contains expected data for its type (frames for optimizations, modes for frequencies, energies for single-points) regardless of completion status
- selection_priority() returns a priority score (Optimization=3, Frequency=2, SinglePoint=1) used by auto-selection logic to prefer more informative task types The viewer’s auto_queue_latest_tasks() function (ui/shell/src/tasks.rs:269) automatically selects the best complete task when loading NWChem files using find_best_complete_nwchem_task(), which prefers tasks with higher priority scores and breaks ties by choosing later tasks. If no complete tasks exist, the function logs a status message and shows a toast notification without attempting to load data. The Analysis → Overview panel (ui/shell/src/panels/analysis/overview.rs:321-367) renders colored status circles next to each task: green (Success), yellow (Incomplete), red (Failed), gray (Unknown). Users can manually click incomplete tasks to load partial data—the viewer respects has_usable_data() to determine if a task is clickable. When extending other format parsers to support per-task outcomes, follow the NwchemTaskSummary pattern: add an outcome field, implement is_complete() and has_usable_data() helpers, update the auto-selection logic, and add status indicators to the relevant UI panel.
When introducing new metadata, update automation/tui/src/panels.rs (analysis view) and the GUI equivalent (ui/shell/src/panels/analysis_tasks.rs) so both experiences stay in sync. Unit tests should cover the shape and ordering of the presented data to catch regressions early.

6.4 Parsing Utilities Reference

The io/pipelines crate provides a comprehensive set of parsing utilities in parsing_utils/ that eliminate code duplication and improve consistency across all format parsers. These utilities have been battle-tested with 116 unit tests and are used extensively throughout the codebase.

Key utilities (by adoption):

parse_whitespace_tokens (31 files) – Zero-copy tokenization of whitespace-separated values. Returns an iterator over &str slices, avoiding allocations for performance-critical parsing loops.
contains_any_marker (10 files) – Multi-pattern string matching for format detection and section identification. Accepts a slice of marker strings and returns true if any are found.
convert_bohr_to_angstrom (8 files) – Unit conversion for atomic units. Handles both single values and coordinate triples with a consistent conversion factor (0.529177210903).
parse_float_after_delimiter (6 files) – Extracts floating-point values from “key = value” patterns. Handles scientific notation (including Fortran D-format) and returns Option<f64>.
parse_scientific_float (7 files) – Robust float parsing with automatic Fortran D-format conversion (1.23D+05 → 1.23E+05). Returns Option<f64> instead of panicking.
parse_coordinate_triple (4 files) – Parses three consecutive floats from whitespace-separated tokens into [f64; 3]. Essential for geometry parsing across all molecular formats.

Additional utilities:

skip_empty_lines (5 files) – Advances an iterator past blank lines
parse_last_float (3 files) – Extracts the last float from a line
parse_float_at_index (1 file) – Gets a float from a specific token index
extract_element_symbol (2 files) – Normalizes atomic symbols (e.g., CA → Ca)

When to use parsing utilities:

Always use parse_whitespace_tokens instead of .split_whitespace().collect::<Vec<_>>() for tokenization – it’s zero-copy and returns an iterator.
Always use parse_scientific_float or parse_float_after_delimiter instead of .parse::<f64>() when parsing floats from quantum chemistry logs – they handle Fortran D-format automatically.
Always use convert_bohr_to_angstrom for unit conversions instead of inline multiplication – it ensures consistent precision across the codebase.
Use parse_coordinate_triple when extracting XYZ coordinates from whitespace-separated tokens (common in XYZ, PDB, log files).
Use contains_any_marker for format detection and section boundaries instead of chaining .contains() calls.

Example migration patterns:

// Before: Manual tokenization and parsing
let parts: Vec<&str> = line.split_whitespace().collect();
if parts.len() >= 4 {
    let x = parts[1].parse::<f64>().ok()?;
    let y = parts[2].parse::<f64>().ok()?;
    let z = parts[3].parse::<f64>().ok()?;
    // ...
}

// After: Using utilities
use crate::parsing_utils::*;
let mut tokens = parse_whitespace_tokens(line);
tokens.next()?; // skip element
let coords = parse_coordinate_triple(&mut tokens)?;

// Before: Nested delimiter searches
if let Some(eq_pos) = line.find('=') {
    if let Ok(value) = line[eq_pos + 1..].trim().parse::<f64>() {
        energy = value;
    }
}

// After: Single utility call
use crate::parsing_utils::parse_float_after_delimiter;
if let Some(value) = parse_float_after_delimiter(line, '=') {
    energy = value;
}

Adding a new format (updated workflow):

When implementing a new parser, follow §6.1 above, but also:

Import utilities at the top of your module:
```
use crate::parsing_utils::*;
```
Use parse_whitespace_tokens for all tokenization tasks
Use parse_float_after_delimiter for “key = value” patterns
Use parse_coordinate_triple for geometry parsing
Use convert_bohr_to_angstrom for atomic unit conversions
Use contains_any_marker for section detection

Quality standards:

All utilities return Option<T> or Result<T> (no panics)
Fortran D-format scientific notation is handled automatically
Zero-copy operations use &str slices where possible
116 unit tests verify edge cases and corner cases
Generic over AsRef<str> for caller flexibility

When to create a new utility:

Consider adding a new utility to parsing_utils/ when: - The pattern appears 3+ times across different format parsers - The logic is non-trivial (e.g., requires format conversion or bounds checking) - Edge case handling is important (scientific notation, whitespace variations) - Consistency is critical (unit conversions, coordinate parsing)

For detailed documentation, see the module-level rustdoc on these utilities in io/pipelines.

6.5 Format Capabilities Reference

Orbitron supports 13+ file formats spanning structural data, quantum chemistry calculations, periodic systems, and volumetric grids. This section provides a comprehensive reference for each format’s capabilities, limitations, and implementation details.

Format Capabilities Matrix

Format	Extensions	Structure	Trajectory	Frequency	Orbitals	Periodic	Thermo	Streaming
XYZ	`.xyz`	✓	✓	✗	✗	✗	✗	✓
PDB	`.pdb`, `.ent`, `.brk`	✓	✗	✗	✗	✓	✗	✓
CIF (incl. mmCIF)	`.cif`	✓	✗	✗	✗	✓	✗	✓
SDF	`.sdf`, `.mol`	✓	✗	✗	✗	✗	✗	✗
VASP	POSCAR, CONTCAR, `vasprun.xml`	✓	✗	✗	✗	✓	✗	✓
Gaussian	`.log`, `.out`, `.gjf`, `.fchk`, `.cube`	✓	✓	✓	✓	✗	✓	✓
NWChem	`.out`, `.nw`	✓	✓	✓	✓	✗	✓	✓
CUBE	`.cube`, `.cub`, `.xsf`	✓	✗	✗	✓	✗	✗	✓
NBO	`.nbo`, FILE47	✓	✗	✗	✓	✗	✗	✗
DIRAC	`.out`, `.h5`	✓	✗	✗	✓	✗	✗	✓
Molpro	(output), `.xml`	✓	✗	✓	✗	✗	✓	✓
Molcas	(output)	✓	✗	✓	✗	✗	✗	✓
Quantum ESPRESSO	`.out`, `.in`, `.xml`, `.xsf`, `.dat`, `.dos`, `.pdos_*`, `.bands`, `.bands.gnu`, `.UPF`	✓	✗	✓	✗	✓	✗	✓

XYZ (`.xyz`)

Capabilities: - Multi-frame trajectory support (native format) - Extended XYZ with per-atom charges, velocities, and forces (parsed from comment line) - Comment line metadata (labels, step numbers, energies) - Automatic element detection from symbols or atomic numbers - Bond inference via SceneBuilder::infer_covalent_bonds

Implementation: io/pipelines/src/formats/xyz/mod.rs - Parser: Line-by-line streaming with parse_whitespace_tokens - Detection: “atom count” header + element symbols + Cartesian coordinates - Canonical: Structure section + optional trajectory positions attachment - Streaming: ✓ (frame-by-frame for large trajectories)

Limitations: - No explicit bond information (all bonds inferred) - No periodic boundary conditions - Element types must be consistent across all frames - Comment line metadata format is not standardized

Test coverage: 27 fixtures (io/pipelines/tests/fixtures/xyz/)

PDB (`.pdb`, `.ent`, `.brk`, `.pdb1`)

Capabilities: - Biological macromolecule structures (proteins, DNA/RNA, ligands) - CRYST1 unit cell parameters for crystallographic data - CONECT records for explicit bond connectivity - Secondary structure annotations (SSBOND, HELIX, SHEET) - Residue, chain, and atom numbering - B-factors (temperature factors) and occupancy values - Multi-model files (NMR ensembles) parsed as separate structures

Implementation: io/pipelines/src/formats/pdb/mod.rs - Parser: Record-based with fixed column positions - Detection: “ATOM”/“HETATM” records + PDB column layout - Canonical: Structure section with residue/chain metadata, CONECT bonds - Streaming: ✓ (line-by-line, single-pass)

Limitations: - Single frame only (multi-model files require separate loads) - No quantum mechanical properties (energies, orbitals, frequencies) - Bond connectivity is optional (CONECT records not always present) - Fixed-width format can be fragile with malformed files

Test coverage: 7 fixtures including proteins, DNA, metal clusters

CIF (`.cif`, including mmCIF)

Capabilities: - Crystallographic Information File (IUCr standard). - Unit cell parameters (a, b, c, α, β, γ). - Fractional and Cartesian atomic coordinates. - Space group symmetry operations (stored under crystal:symops for the Edit→Cell→Apply symmetry workflow). - Atomic occupancy and anisotropic displacement parameters (U_11..U_23). - mmCIF support for macromolecular structures: chain / residue / insertion-code metadata on _atom_site rows; biological assemblies via _pdbx_struct_assembly_gen + _pdbx_struct _oper_list; secondary-structure ribbons via _struct_conf (helix) and _struct_sheet_range (sheet) loops, written to the same pdb:helix_ranges / pdb:sheet_strands metadata tags the PDB parser produces so the existing ribbon renderer (viewer/core/src/ribbon/pass1/ ss.rs) lights up unchanged. - Bond inference from unit cell + atomic radii.

Implementation: io/pipelines/src/formats/cif/mod.rs - Parser: CIF data-block parser with loop/value extraction (parse/mod.rs), per-section sub-modules (atoms.rs, cell.rs, assembly.rs, secondary.rs, helpers.rs). - Detection: data_ blocks + _cell_length_a / _atom_site tags. - Canonical: Periodic structure section with unit cell metadata + pdb:* tags for biological assemblies and ribbons. - Streaming: ✓ (block-by-block).

Limitations: - NMR multi-model ensembles (_atom_site.pdbx_PDB_model_num) load as a single scene rather than a Trajectory. - _struct_conn (explicit bonds: disulfides, metal coordination, link records) is not parsed — bonds are inferred geometrically. - _chem_comp residue-metadata definitions are not consulted for non-standard residues / ligands. - Complex loops may skip unrecognized tags silently. - Symmetry operations are recorded but not applied to generate the full unit cell automatically (use Edit → Cell → Apply symmetry).

Test coverage: 6 crystallography fixtures (diamond, NaCl, silicon, perovskites, benzene) plus the parses_mmcif_secondary_structure_into_ribbon_tags integration test for biology mmCIF helix / sheet loops.

SDF/MOL (`.sdf`, `.mol`)

Capabilities: - MDL Molfile V2000 format - Explicit bond connectivity with bond orders (single/double/triple/aromatic) - Connection table with 3D coordinates - Property data blocks (SD file format) - Formal charges and radical flags

Implementation: io/pipelines/src/formats/sdf/mod.rs - Parser: Fixed-format counts line + atom/bond blocks - Detection: “V2000” tag + connection table structure - Canonical: Structure section with explicit bonds + property metadata - Streaming: ✗ (full file read required)

Limitations: - V3000 format not implemented - No 3D property fields (QM energies, charges) - Formal charge encoding is limited to ±3 - Large SD files (thousands of molecules) load slowly without streaming

Test coverage: 6 fixtures (aspirin, benzene, ethanol, combined samples)

VASP (POSCAR, CONTCAR, `vasprun.xml`)

Capabilities: - Periodic solid-state structures with lattice vectors - Direct (fractional) and Cartesian coordinate modes; first-letter VASP shorthand (c/C/k/K = Cartesian, d/D/f/F = Direct, s/S = Selective dynamics) honoured - VASP-4.x POSCARs (no species line) fall back to a sibling POTCAR for element identity, with PAW pseudopotential suffix stripping (Si_d_GW → Si) - Inline-comment scale lines (0.52918 ! scaling parameter) - vasprun.xml: Total density of states (DOS), band structure, Fermi energy, per-atom forces (final ionic step) via parse_vasprun_final_forces - DOS/band structure export to CSV + PNG plots - Volumetric: CHGCAR / CHG / PARCHG parsed by parse_chgcar (in chgcar.rs); spin-polarized runs surface as two VolumetricData blocks (total + spin density). Routed through volumetric_loader so the same Surfaces pipeline cube files use serves CHGCAR. - OUTCAR parser (outcar.rs::parse_outcar) extracts last-step TOTAL-FORCE (eV/Angst) block + magnetization (x) per-atom moments. Drives D4 force arrows + D5 magmom halo. - XDATCAR parser (xdatcar.rs::parse_xdatcar) returns multi-frame Trajectory (constant-cell only — variable-cell NPT uses first-frame lattice). - ACF.dat Bader output (bader.rs::parse_acf_dat) returns per-atom electron populations; net charge Z − population feeds the Bader halo (AtomColorScheme::VaspBaderCharges). - Sources sub-tab (sources_runtime.rs + viewer/core/src/ui_state/sources/vasp.rs) auto-detects all VASP filenames and surfaces the run’s directory contents with Loaded/Detected/Missing badges. POSCAR ↔︎ CONTCAR rows expose a Compare button that pushes the sibling as a scene overlay for relaxation diffs. - Cell-conversion edit commands: LinearCellTransformCommand (primitive ↔︎ conventional for cF/cI/oF/oI/tI/hR), CleaveSlabHklCommand ((hkl) slab cut + vacuum padding), and the existing supercell / Niggli / wrap operations.

Implementation: io/pipelines/src/formats/vasp/mod.rs - Parsers: poscar.rs, parse.rs (vasprun.xml DOM/streaming), chgcar.rs, outcar.rs, xdatcar.rs, bader.rs, doscar.rs, procar.rs - Detection: POSCAR (lattice vectors + scaling + element list), XML (“vasprun” root), VASP filename match for extensionless files - Canonical: Periodic structure + electronic structure attachment (bands/DOS) - Streaming: ✓ (XML only; POSCAR/CHGCAR/OUTCAR/XDATCAR are single-pass) - Auto-collection of companion files (DOSCAR / PROCAR / OUTCAR / ACF.dat / vasprun fallback for forces) lives in ui/shell/src/viewer_loop/background_events/vasp.rs::hydrate_vasp_bundle_artifacts and runs whenever the active scene’s parent directory holds matching files.

Removed in 2026-05: the .zip/.tar.gz/.tgz archive-import flow (session/bundles.rs::extract_zip_bundle / extract_tar_bundle / select_bundle_entry, plus BundleUiState::bundle_mounts / current_bundle_root / last_vasp_bundle_scan). VASP runs now open by selecting any canonical file directly; companions resolve via Sources. zip / tar / flate2 were dropped as ui-shell dependencies.

Limitations: - POSCAR/CONTCAR: single structure only (XDATCAR covers trajectories instead) - Projected DOS parsing is limited compared to QE - Electronic structure requires vasprun.xml (not available from POSCAR alone) - Large XML files (1+ GB) require significant memory for DOM parsing - Per-atom magmom from vasprun.xml requires integrating spin-decomposed partial DOS (not implemented; OUTCAR is the canonical source) - XDATCAR variable-cell NPT runs use first-frame lattice for all frames

Test coverage: unit tests in each parser file + integration tests in io/pipelines/tests/vasp.rs (corpus-gated qm_tests_corpus_sanity walks every directory under ~/Desktop/qm_tests/vasp/, asserts every POSCAR loads cleanly).

Gaussian (`.log`, `.out`, `.gjf`, `.com`, `.cube`, `.fchk`)

Capabilities: - Multi-stage jobs (Link1) with stage boundaries - Optimization trajectories (geometry steps + energies) - Vibrational frequencies with IR/Raman intensities - Molecular orbitals (CUBE files, formatted checkpoint data) - Population analysis (Mulliken, Löwdin, Natural/NBO if available) - Electronic excited states (TD-DFT, CIS) - Method and basis set extraction - SCF convergence details - Thermochemistry data (zero-point energy, enthalpy, free energy)

Implementation: io/pipelines/src/formats/gaussian/mod.rs - Parser: Line-by-line with section markers (“Standard orientation”, “Frequencies”, etc.) - Detection: “Gaussian” header + “Copyright” line - Canonical: Multi-stage summary with attachments for MO coefficients, trajectory positions - Streaming: ✓ (stage boundary loaders: gaussian_stage_scene_by_boundary)

Limitations: - Complex multi-stage parsing requires robust stage detection - FCHK support is partial (geometry + basis only) - Checkpoint files (.chk) require formchk preprocessing - Some population methods (Hirshfeld, CHELPG) not fully extracted

Test coverage: 30+ fixtures including optimization, frequency, TDDFT, NBO jobs

NWChem (`.out`, `.nw`, `.movecs`, `.hess`)

Capabilities: - Multi-task detection (Optimization, Frequency, Raman, Single-Point, Property) - Task-level outcome tracking (Success, Incomplete, Failed, Unknown) - Optimization trajectories with per-step energies - Vibrational frequencies + IR intensities - Raman spectrum parsing from .normal sidecar files - Molecular orbital coefficients (truncated from .out’s top-N table; full nbf × nmo from .movecs) - Basis-set definition parsed from the Basis "ao basis" printout into a GaussianBasisSet (Cartesian d/f only — pure-spherical d not yet permuted) - Population analysis (Mulliken, Löwdin, Natural) - TDDFT excited states - Thermochemistry data - Task byte boundaries for efficient random-access loading - Cartesian Hessian (.hess) parser + Jacobi mode synthesis (mass-weighted projection, 5-6 rigid-body modes removed) — turns a stand-alone .hess into FrequencyData even without an .out freq block - Input deck (.nw) structured summary: charge, nopen, basis libraries, ECP, xc, task list

Implementation: io/pipelines/src/formats/nwchem/mod.rs - Parser: Modular task scanner + per-task metadata extraction - Detection: “Northwest Computational Chemistry Package” or “NWChem” header - Canonical: Task summaries with outcome + attachments (MO coefficients, trajectory shards, Raman spectra) - Streaming: ✓ (task boundary loaders: nwchem_task_scene_by_boundary, nwchem_task_trajectory_by_boundary) - Basis attachment: parse_basis_set + build_gaussian_basis in basis_set.rs. The trajectory loader parses the basis once for the whole task and stamps it onto every frame’s electronic_structure, including frame 0 (which is what the viewer displays as the active scene). - Movecs: parse_movecs in movecs.rs handles Fortran unformatted records with auto-detected i32/i64 integer width. - Hessian: parse_hessian + frequencies_from_hessian in hessian.rs; uses the pure-Rust Jacobi solver in core/math/src/jacobi.rs.

Sources Load handlers (ui/shell/src/viewer_loop/runtime/redraw/panels/mod.rs): - RoleId::NwchemHessian → handle_nwchem_hessian_load: synthesises FrequencyData and switches to Vibrations. - RoleId::NwchemMovecs → handle_nwchem_movecs_load: attaches MOs as MolecularOrbital records (preserving atom-prefixed labels harvested from prior .out-parsed MOs so the halo overlay still works), populates GaussianBasisSet.mo_coefficients_alpha/beta when the basis is present (with a Cartesian d-shell column permutation xx,xy,xz,yy,yz,zz → FCHK xx,yy,zz,xy,xz,yz), and switches to Orbitals. - RoleId::NwchemInput → toasts a deck summary; only swaps the active scene when none is loaded.

Limitations: - Complex output format with many task types (some partially supported) - TDDFT features limited to basic excitation energies - Periodic DFT (plane-wave) outputs not fully parsed - Some advanced property analyses (response, NMR) have limited extraction - Pure-spherical d/f from NWChem (uncommon — 6-31G*, def2 default to Cartesian) needs a separate column permutation table; today it falls back to fchk_conversion’s 5D→6D mapping which expects Gaussian’s pure-d ordering, not NWChem’s - .zmat and .civecs files are detected but not yet parsed — .civecs would unlock CI/MCSCF excited-state densities

Test coverage: Test corpus with energy, optimization, frequency, Raman jobs; .movecs corpus walks ~/Desktop/qm_tests/nwchem/ cleanly across 21 binary files; .hess parses CO2 and ammonium fixtures; basis-set attachment verified end-to-end on ammonium (NH4+ in 6-31G* → 23 nbf, 14 shells over 1 N + 4 H).

Key Feature: Per-task outcome detection (NwchemTaskSummary::outcome) allows smart auto-selection and UI status indicators (see §6.3).

CUBE (`.cube`, `.cub`, `.xsf`)

Capabilities: - Volumetric grid data (molecular orbitals, electron density, electrostatic potential) - 3D regular grid with origin, voxel vectors, and point data - Atomic coordinates embedded in header - Directory mode: multiple CUBE files with shared geometry validation - Lazy grid loading for memory efficiency - Marching cubes isosurface generation - Program-agnostic format (Gaussian, NWChem, QE, ORCA, etc.)

Implementation: io/pipelines/src/formats/cube/mod.rs - Parser: Header (atom count, origin, axes, atoms) + grid data block - Detection: Atom count + origin + three axis vectors - Canonical: Volumetric attachment + dataset metadata - Streaming: ✓ (header parsed first, grid streamed separately)

Limitations: - One dataset per file (no multi-grid CUBE format) - Large grids (500³ points) require significant memory even with lazy loading - No standard naming convention (HOMO.cube vs homo_001.cube varies by program) - .xsf format treated identically to CUBE (QE-specific features ignored)

Test coverage: Multiple CUBE files for orbitals, density, potential

NBO (`.nbo`, FILE47)

Capabilities: - NBO7 archive (FILE47) parsing for natural population analysis - AO basis function metadata - Natural orbital coefficients (from .37 plot files) - Orbital labels and occupancies (from .46 files) - Integration with Gaussian/NWChem for combined QM+NBO analysis - Population tables (Natural charges, Wiberg bond indices)

Implementation: io/pipelines/src/formats/nbo/mod.rs - Parser: FILE47 binary parser + associated text files - Detection: “NBO” or “FILE47” markers in .nbo or .47 files - Canonical: NBO summary with population extras + optional basis/geometry from FILE47 - Streaming: ✗ (requires full FILE47 read)

Limitations: - Requires NBO7 output format (NBO6 and earlier not supported) - Limited to NBO-specific data (no general QM properties) - FILE47 sidecar must be present for basis/geometry reconstruction - Orbital visualization requires separate CUBE generation

Test coverage: NBO fixtures with and without FILE47 sidecars

DIRAC (`.out`, `.h5`)

Capabilities: - Relativistic quantum chemistry calculations (1-, 2-, and 4-component spinor formalism). - TDDFT excited states with spin-orbit coupling. - Gross population analysis (Mulliken). - Symmetry-resolved orbitals. - Task detection (SCF, DFT, RESOLVE, TDDFT). - HDF5 checkpoint (.h5) reader: MO coefficients, eigenvalues, occupations, AO basis, molecule geometry. Quaternion-units nz (1 / 2 / 4) is honoured — for nz ≥ 2 the dominant-z slice is selected per MO so β-spinor-dominant orbitals (which leave z=0 near zero) are recovered correctly. - Large-component AO basis reconstructed into a GaussianBasisSet so the Surfaces tab can render relativistic MOs as isosurfaces.

Implementation: - io/pipelines/src/formats/dirac/mod.rs — task boundary scanner and metadata extraction for .out text outputs. - dirac/checkpoint/ — HDF5 checkpoint reader split into types.rs (public DiracCheckpointData), coefficients.rs (orbital-coefficient slicing with nz handling), aobasis.rs (Large-component basis → GaussianBasisSet), datasets.rs (HDF5 walk helpers), and labels.rs (basis-function labels). - Detection: “DIRAC” / “Dirac” header lines for .out; HDF5 signature byte test for .h5. - Canonical: Task summaries with relativistic metadata. - Streaming: ✓ for .out; .h5 loaded fully (small).

Limitations: - Specialized for relativistic methods (not general-purpose QM). - Geometry optimization tracking is limited. - Some advanced features (KRCI, Fock-space CC) have minimal support. - Phase information from quaternion components z=1..nz-1 is discarded when extracting the dominant real-major slice. - Small-component (/input/aobasis/2) basis is not used for isosurface rendering — chemistry-relevant orbital structure lives in Large.

Test coverage: DIRAC output fixtures for SCF, TDDFT, population tasks. .h5 real-fixture tests use the user’s qm_tests/dirac/ paths (gracefully skip on CI).

Molpro (output files, `.xml`)

Capabilities: - Multi-reference methods (CASSCF, MRCI, CASPT2) - Correlated calculations (CCSD, CCSD(T), MP2) - Frequency analysis - Task detection with program/method identification (RHF/UHF, CCSD, MULTI, OPTG, FREQ) - XML sidecar parsing for extended metadata - Thermochemistry data

Implementation: io/pipelines/src/formats/molpro/mod.rs - Parser: Task scanner + XML manifest reader - Detection: “Molpro” + version line, or XML root element - Canonical: Task summaries with correlated energies + method metadata - Streaming: ✓ (task boundary loaders)

Limitations: - Complex multi-method outputs (some methods partially supported) - Orbital extraction limited (no direct CUBE export) - Some advanced features (explicit correlation, local methods) not fully parsed - XML sidecar optional but required for full metadata

Test coverage: Molpro output fixtures for CCSD, CASSCF, optimization, frequency jobs

CLI Integration: orbitron inspect --molpro-task N and --molpro-kind freq filters for task-specific extraction.

Molcas/OpenMolcas (output files)

Capabilities: - Multi-configurational methods (CASSCF, RASSCF) - Perturbation theory (CASPT2, MS-CASPT2) - Optimization metadata (energy profiles, gradient convergence) - Frequency modes and thermochemistry - Task detection per module (SCF, RASSCF, CASPT2, OPT, FREQ) - Active space diagnostics (orbitals, spin, symmetry)

Implementation: io/pipelines/src/formats/molcas/mod.rs - Parser: Module scanner + per-module metadata - Detection: “Molcas” or “OpenMolcas” + module invocations - Canonical: Task summaries with RASSCF/CASPT2 diagnostics (extras.molcas) - Streaming: ✓ (module boundary loaders)

Limitations: - Limited electronic structure details compared to Gaussian/NWChem - Complex module structure (some modules partially supported) - Gradients and Hessians not fully extracted

MO surfaces via MOLDEN: Molcas writes orbital coefficients to sibling *.scf.molden, *.rasscf.molden, *.guessorb.molden, and *.mp2.molden files. The shared MOLDEN parser (next section) ingests these and the Sources Load button on RoleId::MolcasMolden attaches a complete electronic_structure (atoms + basis + MOs) to the active scene — Surfaces can render orbitals immediately without a separate FCHK / movecs companion.

Test coverage: Molcas output fixtures for RASSCF, CASPT2, optimization, frequency; qm_tests_molden_corpus_sanity walks every .molden under ~/Desktop/qm_tests/molcas/ and parses 17+ orbital-bearing files cleanly across SCF, RASSCF, MP2, and Guess flavors, including PbO with 6 spherical f-shells (pbo.scf.molden).

MOLDEN (`*.molden`)

Capabilities: - Portable text format emitted by Molpro, Molcas, ORCA, Turbomole, and many other QC packages - Atoms (AU or Angstrom), basis-set definition, MO coefficients (alpha + beta) - Spherical d/f/g flags ([5D], [7F], [9G]) with FCHK-compatible AO ordering - Per-MO metadata: symmetry label, energy, spin, occupancy - NBO MOLDEN dialect (atom-block headers with optional second integer flag)

Implementation: io/pipelines/src/formats/molden/mod.rs - Parser: Section walker (SectionWalker) + per-section parsers (atoms / GTO / MO) - Output: MoldenData { atoms, basis: GaussianBasisSet, mos: Vec<MoldenMo> } - AO ordering matches fchk_conversion’s expected layout — no shell-column permutation needed (unlike NWChem’s .movecs which uses a different Cartesian d / f convention)

Sources Load handler (handle_molden_load in ui/shell/src/viewer_loop/runtime/redraw/panels/mod.rs): wires RoleId::MolproMolden and RoleId::MolcasMolden. Replaces the active scene’s basis with the MOLDEN basis (single source of truth), populates mo_coefficients_alpha/beta from the MO list, switches to the Orbitals tab, marks the row Loaded ✓.

Limitations: - sp combined shells (Pople-style) not yet supported — Molcas decomposes them so this is mostly an issue for older ORCA / Turbomole exports - Geometry-only / frequency-only MOLDEN flavors (*.geo.molden, *.freq.molden, [GEOCONV] / [N_FREQ] sections) are valid MOLDEN but out of scope; the parser returns IoPipelineError::Parse if you feed one in directly. The corpus test skips these by checking for [GTO] first.

Test coverage: 3 unit tests (acrolein SCF, benzene SCF, header rejection) + corpus walker (qm_tests_molden_corpus_sanity).

Quantum ESPRESSO (`.out`, `.in`, `.xml`, `.xsf`, `.dat`, `.dos`, `.pdos_*`, `.bands`, `.bands.gnu`, `.UPF`)

Capabilities: - Periodic DFT calculations (plane-wave basis) - SCF / relax / nscf outputs from .out text files - Input deck (.in) parsing — namelists &CONTROL / &SYSTEM, free-form ATOMIC_SPECIES / ATOMIC_POSITIONS / CELL_PARAMETERS blocks; lattice derivation for ibrav 0–14 (including centred orthorhombic, monoclinic, triclinic) plus alat / bohr / angstrom / crystal position units. - Structured XML (<prefix>.xml / data-file-schema.xml) parsing via roxmltree: atomic_structure, band_structure (per-k-point eigenvalues + occupations), Fermi level, total energy, convergence status, exit status, lsda / noncolin / spinorbit flags, creator program/version. - XSF reader handling both proper XSF (CRYSTAL / MOLECULE / ATOMS keyword files) and QE’s “Cube-as-xsf” flavour (pp.x output_format=6 writes Gaussian Cube content with an .xsf extension — routes through the existing cube parser). - DOS / PDOS / bands plot summaries from dos.x, projwfc.x, bands.x outputs (canonical extras + Sources Load summary toasts). - Phonon modes and dispersion from ph.x / q2r.x. - Bravais lattice + reciprocal lattice vectors. - SCF convergence history.

Implementation: - io/pipelines/src/formats/qe/mod.rs — handler + detection dispatch. - qe/canonical/builder.rs — .out text-output canonical builder (geometry, energetics, task list). - qe/input.rs — .in namelist + geometry-block parser. 13 unit tests including ibrav volume invariants (5/7/9/10/11/12/14) and the user’s Si / Fe / SrTiO₃ / graphene / benzene fixtures. - qe/xml.rs — roxmltree-based data-file-schema.xml parser. 8 unit tests using real qm_tests/qe/ fixtures (gracefully skip when fixtures absent on CI). - qe/xsf.rs — proper-XSF + Cube-as-xsf dispatcher. 5 unit tests. - qe/dos.rs, pdos.rs, bands.rs — plot-data parsers (existing). - Streaming: ✓ for SCF/relax outputs; .dat and XML loaded fully.

Sources manifest (viewer/core/src/ui_state/sources/qe.rs): 10 roles. Required: QePrimary (.out), QeInput (.in). Optional/Advanced: QeXml (.xml), QeUpf (.UPF). Plot data: QeDos (.dos / *_dos.dat), QePdosTotal (.pdos_tot), QePdosAtomic (.pdos_atm*), QeBands (.bands / .bands.dat), QeBandsGnu (.bands.gnu), QeXsf (.xsf). Sets enforce_glob_stem_match = false because QE post-processing files use unrelated stems (UPFs by element, XML by &CONTROL prefix).

Limitations: - A dedicated QE Spectra panel that plots DOS / PDOS / band structure (mirroring the VASP equivalent) is future work — the parsers are in place but the rendering UI is not yet wired. - .UPF pseudopotentials are surfaced informationally in Sources but not parsed. - atomic_proj.xml (projwfc.x output) and the <prefix>.save/ binary checkpoint hierarchy are not parsed. - ibrav -12 / ±13 (centred monoclinic variants) are treated as primitive monoclinic with a warning that the centring isn’t expanded.

Test coverage: SCF / relax fixtures for benzene, graphene, silicon, SrTiO₃, Fe, FeO; XML / XSF / .in real-fixture tests via qm_tests/qe/ paths (gracefully skip on CI).

Canonical Integration: extras.qe.scf_total_energy_ry, relax_profile, dos_summary, bands_summary, pdos_summary.

Adding New Formats

When extending Orbitron with a new format:

Create format module under io/pipelines/src/formats/<format>/mod.rs
Implement detection (detect_<format>(path, &[u8]) returning Option<bool>)
Write parsers (parse_<format>_scene, _trajectory, _frequency)
Canonical converter (follow patterns in §6.2)
Register handlers (formats/mod.rs::register_builtin_handlers)
Add fixtures (io/pipelines/tests/fixtures/<format>/)
Write tests (io/pipelines/tests/<format>_tests.rs)
Update CLI (automation/cli for format-specific commands)
Update documentation (this section + USER_GUIDE.md §2)

See §6.1 for the complete checklist and example code patterns.

6.6 Sources Subsystem (companion-file manifests)

The Sources subsystem renders the Analysis → Sources sub-tab and powers companion-file detection for multi-file formats. Coverage today: NBO7, VASP, NWChem, Gaussian, Molpro, Molcas, DIRAC, Quantum ESPRESSO. Every supported format declares a static manifest of “roles” — what each sibling file contributes — and the runtime walks the active scene’s directory matching siblings against role patterns.

Type layer (viewer/core/src/ui_state/sources/): - manifest.rs — DetectedFormat enum, RoleId (one variant per role across all formats), CompanionRole (label + filename patterns + group + dependents), RoleGroup (Required / OptionalAdvanced / PlotData), RoleStatus (Loaded / Detected / Missing), and the filename_matches glob helper. Globs support four shapes: *X (ends-with, e.g. *.31), *X* (contains, e.g. *.pdos_atm*), X* (starts-with), and bare X (exact filename match like POSCAR). - nbo.rs — NBO_MANIFEST: archive .47 plus .31–.46 plot files, .nbo analysis text. - vasp.rs — VASP_MANIFEST: 14 roles (vasprun.xml, POSCAR, CONTCAR, OUTCAR, OSZICAR, INCAR, KPOINTS, POTCAR, DOSCAR, EIGENVAL, PROCAR, CHGCAR/CHG/PARCHG, XDATCAR, DYNMAT). - nwchem.rs — 7 roles: Output (*.out), Input (*.nw), *.movecs, *.hess, *.zmat, *.cube, *.civecs. - gaussian.rs — 5 roles: Output (*.log/*.out), Input (*.gjf/*.com), *.fchk, *.chk, *.cube. - molpro.rs — 6 roles: Output, Input (*.inp/*.com), *.xml, *.log, *.molden, *.cube. - molcas.rs — 10 roles: Output, Input, *.opt.xyz, the orbital family (*.ScfOrb, *.RasOrb, *.GssOrb, *.LprOrb, *.Mp2Orb), *.molden, status. - dirac.rs — 4 roles: Output, Input, *.mol, *.h5. - qe.rs — 10 roles: Output (*.out), Input (*.in), *.xml, *.UPF, *.dos/*_dos.dat, *.pdos_tot, *.pdos_atm*, *.bands/*.bands.dat, *.bands.gnu, *.xsf. Sets enforce_glob_stem_match = false because QE post-processing files use unrelated stems (UPFs by element, XML by &CONTROL prefix). - mod.rs::manifest_for — dispatches a DetectedFormat to its manifest. New formats add one match arm here. - state.rs — SourcesState storing the detected format, scan root, last-detected-path cache, and per-role status vector.

Per-manifest stem-match flag: FormatManifest.enforce_glob_stem_match controls whether glob patterns additionally require sibling stems to match (or extend with a dot suffix) the primary file’s stem. True for every format except QE — QE’s pseudopotentials are named by element, the structured XML uses the &CONTROL prefix keyword (often differs from the .in filename), and post-processing files inherit whatever output stem the user configured dos.x / projwfc.x / bands.x to write.

Multi-format .out sniffing: NWChem / Molpro / Molcas / DIRAC / QE / Gaussian all share the .out (and Gaussian also .log) extension. detect_format runs sniff_is_* in priority order — Molpro → Molcas → DIRAC → QE → NWChem → Gaussian — and caches the result on SourcesState.last_detected_path so the sniff doesn’t re-run every redraw. .in files route directly to QE without a content sniff (no other QC code in the dispatch list uses that extension); .xml files are content-sniffed for the QE namespace string to disambiguate from VASP’s vasprun.xml (which is caught earlier by path_is_vasp_primary).

Runtime layer (ui/shell/src/sources_runtime.rs): - refresh_sources(ui_state) runs at the top of every panel pass and is also called once after the menu_bar handler (which clobbers temp_ui_state.analysis to preserve File→Open menu actions). Idempotent — early-returns when format and scan_root haven’t changed. - detect_format checks the NBO workspace first (an .47 may be loaded into memory without being the active scene), then falls back to the active scene path: .47 extension → NBO7; canonical VASP filename → VASP. - update_role_statuses walks the manifest, calls loaded_path_for_role to recognise files already in the workspace, and otherwise scans sibling files in the run directory. - Mode-aware sibling matching in resolve_role_status: glob patterns (*.31) require stem-match against the primary file (so a fixtures-dir of u2oplot.* and uo2-test.* doesn’t cross-pollinate). Exact filenames (POSCAR) skip the stem constraint since each VASP file has a canonical name and can’t collide with another role. Stem-match also accepts extended stems — siblings whose stem starts with <primary_stem>. qualify, which handles Molcas’ *.scf.molden / *.rasscf.molden task-suffix convention without permitting cross-run matches (the dot anchor protects against acrolein2.scf.molden matching acrolein.out).

Panel + actions: - ui/shell/src/panels/analysis/sources.rs renders one row per role with a status icon, the role’s label/description tooltip, and one of three button affordances (Reload / Load / Add file…). VASP POSCAR/CONTCAR rows also expose a Compare button that emits SourcesAction::OverlayCompanion. - ui/shell/src/viewer_loop/runtime/redraw/panels/mod.rs::handle_sources_action dispatches role loads: - NBO roles stage into NboWorkspace via load_archive_contents / add_supporting_file_contents. - VASP roles (most) re-run begin_loading_path so the canonical pipeline auto-collects siblings; analysis state is mirrored back to temp_ui_state to survive the end-of-frame sync clobber. - CHGCAR/CHG/PARCHG register as orbital datasets and switch to the Surfaces tab. - XDATCAR loads via the trajectory dispatcher (see formats/loaders/trajectory.rs::load_trajectory). - OverlayCompanion pushes the file as a scene overlay through register_overlay_from_path (no primary-scene swap).

Adding Sources support for a new format: 1. Add the RoleId variants to viewer/core/src/ui_state/sources/manifest.rs. 2. Create viewer/core/src/ui_state/sources/<format>.rs with the manifest constant. 3. Wire manifest_for in mod.rs. 4. Extend sources_runtime::detect_format and loaded_path_for_role to recognise the format’s primary file and any in-workspace state. 5. Branch handle_sources_action for any role-specific load behavior; default is begin_loading_path (swap primary scene). Volumetric / trajectory roles need their own dispatch.

6.4 Orbitron Scene JSON Format

This subsection describes the Orbitron Scene JSON format — a stable, human-readable representation of a molecular scene.

The format is produced by SceneGraph::to_json() in Rust, Scene.to_json() in the Python bridge, and consumed by SceneGraph::from_json(), Orbitron.load_json(), and the WASM set_scene_json() function.

Envelope

Every Orbitron scene JSON file starts with the envelope fields:

{
  "format": "orbitron-scene",
  "version": "1.0",
  ...scene fields...
}

Field	Type	Value	Required
`format`	string	`"orbitron-scene"`	Yes
`version`	string	`"1.0"`	Yes

The envelope is optional for input — from_json() also accepts bare snapshot JSON (without format/version). It is always present in output.

Scene Fields

The scene fields are flattened into the top-level object alongside the envelope fields.

`atoms` (required)

Array of atom records.

"atoms": [
  {
    "id": 0,
    "position": [1.2, 0.0, -0.5],
    "atomic_number": 6,
    "mass_number": null,
    "formal_charge": 0,
    "properties": {}
  }
]

Field	Type	Description
`id`	integer	Unique atom identifier (u64)
`position`	[f32, f32, f32]	Cartesian coordinates in Ångström
`atomic_number`	integer (1–118)	Element atomic number (H=1, C=6, …)
`mass_number`	integer or null	Isotopic mass (null = natural abundance)
`formal_charge`	integer (i8)	Formal charge (typically −4 to +4)
`properties`	object	Arbitrary key–value properties (see Properties)

`explicit_bonds` (optional)

Array of explicit bond records. If omitted, bonds are inferred from interatomic distances by the viewer.

"explicit_bonds": [
  {
    "id": 0,
    "atoms": [0, 1],
    "order": "Single",
    "properties": {}
  }
]

Field	Type	Description
`id`	integer	Unique bond identifier (u64)
`atoms`	[u64, u64]	IDs of the two bonded atoms (ordered)
`order`	string or null	`"Single"`, `"Double"`, `"Triple"`, `"Aromatic"`, `"Unspecified"`, or `{"Other": N}`
`properties`	object	Arbitrary key–value properties

`metadata` (optional)

"metadata": {
  "name": "Caffeine",
  "source": "PubChem CID 2519",
  "tags": {
    "gaussian:last_energy_hartree": "-679.514",
    "gaussian:optimization_converged": "true"
  }
}

Field	Type	Description
`name`	string or null	Human-readable molecule name
`source`	string or null	Provenance / file path / URL
`tags`	object (string→string)	Arbitrary metadata key–value pairs

`unit_cell` (optional)

Present for periodic systems (crystals, slabs, etc.).

"unit_cell": {
  "a": [5.43, 0.0,  0.0],
  "b": [0.0,  5.43, 0.0],
  "c": [0.0,  0.0,  5.43],
  "periodic": [true, true, true]
}

Field	Type	Description
`a`	[f32, f32, f32]	Lattice vector a in Ångström
`b`	[f32, f32, f32]	Lattice vector b in Ångström
`c`	[f32, f32, f32]	Lattice vector c in Ångström
`periodic`	[bool, bool, bool]	Which axes are periodic [a, b, c]

`digest` (optional)

A Blake3 content hash. Produced automatically by to_json(); ignored on input but preserved for cache invalidation.

"digest": {
  "hash": "3a8f..."
}

Properties

Both atoms and bonds carry an optional properties object. Keys are strings; values follow the Orbitron Value enum, which serialises as:

Rust variant	JSON representation
`Float(f64)`	`{"Float": 1.23}`
`Int(i64)`	`{"Int": -5}`
`Bool(bool)`	`{"Bool": true}`
`Text(string)`	`{"Text": "hello"}`

Complete Minimal Example

{
  "format": "orbitron-scene",
  "version": "1.0",
  "atoms": [
    {"id": 0, "position": [0.0, 0.0, 0.0], "atomic_number": 6,
     "mass_number": null, "formal_charge": 0, "properties": {}},
    {"id": 1, "position": [1.54, 0.0, 0.0], "atomic_number": 6,
     "mass_number": null, "formal_charge": 0, "properties": {}}
  ],
  "explicit_bonds": [
    {"id": 0, "atoms": [0, 1], "order": "Single", "properties": {}}
  ],
  "metadata": {"name": "Ethane", "source": null, "tags": {}}
}

Loading in Different Contexts

Rust:

use orbitron_backbone::SceneGraph;

let scene = SceneGraph::from_json(json_str)?;
let json = scene.to_json()?;

WASM (JavaScript / TypeScript):

// URL auto-detected by extension
await viewer.loadScene("./caffeine.json");

// Or inline JSON
await viewer.loadSceneJSON(jsonString);

// Or directly via WASM binding
import { set_scene_json } from "./orbitron-viewer-wasm.js";
set_scene_json(jsonString);

Python:

from orbitron import Orbitron

orb = Orbitron()
scene = orb.load("caffeine.xyz")

# Export to JSON
json_str = scene.to_json()

# Reload from JSON
scene2 = orb.load_json(json_str)

# Inline Jupyter display (calls to_json() internally)
scene  # displays interactive 3D viewer in notebook

Stability

The format and version fields are reserved for future backward-compat negotiation.
The 1.0 schema is stable: adding new optional fields is allowed; removing or renaming fields requires a version bump.
The binary .bin format (bincode) is not stable — struct changes break it. Use JSON for long-lived files or cross-language exchange.

6.1 Adding a New Chemistry Format

6.2 Canonical Pipeline & Attachments

6.3 Task Metadata & Thermochemistry Pipelines

6.4 Parsing Utilities Reference

6.5 Format Capabilities Reference

Format Capabilities Matrix

XYZ (.xyz)

PDB (.pdb, .ent, .brk, .pdb1)

CIF (.cif, including mmCIF)

SDF/MOL (.sdf, .mol)

VASP (POSCAR, CONTCAR, vasprun.xml)

Gaussian (.log, .out, .gjf, .com, .cube, .fchk)

NWChem (.out, .nw, .movecs, .hess)

CUBE (.cube, .cub, .xsf)

NBO (.nbo, FILE47)

DIRAC (.out, .h5)

Molpro (output files, .xml)