Document normalize command

2026-05-09 12:35:48 +00:00
parent 5b008e272c
commit 3591041fa8
2 changed files with 93 additions and 4 deletions
--- a/architecture.md
+++ b/architecture.md
@@ -3,7 +3,8 @@
 `seriatim` is a deterministic transcript utility for:

 - merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming.
+- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming, and
+- canonicalizing external transcript-style JSON inputs into standard seriatim output schemas.

 The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.

@@ -60,7 +61,7 @@ configuration check

 Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.

-`merge` runs this pipeline. `trim` is intentionally separate from this pipeline and operates at the artifact layer.
+`merge` runs this pipeline. `trim` and `normalize` are intentionally separate from this pipeline and operate at the artifact layer.

 ## Stage Contracts

@@ -214,6 +215,24 @@ Design constraints:

 `trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.

+### 8. Artifact Canonicalization Stage (`normalize` command)
+
+`normalize` is an artifact-level command that reads transcript-like JSON and emits a standard seriatim output artifact in a selected schema.
+
+Design constraints:
+
+- `normalize` runs outside the merge pipeline and does not invoke merge preprocessing or postprocessing modules.
+- `normalize` accepts two input shapes: object-with-`segments` and bare segment arrays.
+- `normalize` validates required segment fields (`start`, `end`, `speaker`, `text`) and timing/speaker constraints.
+- `normalize` sorts segments deterministically by chronological keys and stable input-index tie-breakers.
+- `normalize` assigns fresh sequential output IDs (`1..N`) after sorting.
+- `normalize` validates final output against the selected schema before writing.
+- `normalize` writes optional deterministic report diagnostics when `--report-file` is requested.
+
+`normalize` is intended for canonicalizing external transcript outputs (including Audita-style bare arrays) into seriatim contracts, not for running merge-time language or overlap transformations.
+
+`normalize` must not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect.
+
 ## Module Classification

 Modules should be classified by their contract and allowed effects.
@@ -442,6 +461,13 @@ Trim-specific determinism requirements:
 - Old-to-new ID mapping in trim reports is emitted in deterministic order.
 - Full-schema overlap recomputation is deterministic for the same input artifact and selector.

+Normalize-specific determinism requirements:
+
+- Input-shape detection is deterministic.
+- Segment ordering is deterministic for identical input data.
+- Output IDs are always reassigned sequentially after deterministic sorting.
+- Normalize diagnostic reports are deterministic for identical inputs and configuration.
+
 ## Go Package Layout

 ```text
@@ -450,6 +476,7 @@ internal/config/         CLI/env/config loading and validation
 internal/pipeline/       Pipeline orchestration and module registry
 internal/builtin/        Built-in pipeline modules
 internal/artifact/       Conversion from internal model to public output schema
+internal/normalize/      Normalize input parsing, validation, deterministic sorting, schema conversion, and diagnostics
 internal/trim/           Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
 internal/buildinfo/      Build-time version metadata
 internal/speaker/        Speaker map parsing and lookup
@@ -468,6 +495,12 @@ For trim:
 - CLI command code handles only flag parsing, file I/O, and report emission.
 - Transform logic is deterministic and pure except for command-layer I/O.

+For normalize:
+
+- `internal/normalize` contains parsing/validation and deterministic schema conversion logic.
+- CLI command code handles flag parsing and delegates execution.
+- Normalize remains artifact-level and does not compose merge pipeline modules.
+
 ## Default Modules

 The default pipeline is equivalent to explicit module lists.