From 3591041fa8dfdf1d4039d5df95d0a73f05f0f3df Mon Sep 17 00:00:00 2001 From: Eric Rakestraw Date: Sat, 9 May 2026 12:35:48 +0000 Subject: [PATCH] Document normalize command --- README.md | 60 +++++++++++++++++++++++++++++++++++++++++++++++-- architecture.md | 37 ++++++++++++++++++++++++++++-- 2 files changed, 93 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index fad9ad4..d260c17 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,8 @@ # seriatim -`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID. +`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID and normalizes external transcript-like JSON into standard seriatim output schemas. -The current implementation supports the `merge` and `trim` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset. +The current implementation supports the `merge`, `trim`, and `normalize` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset. `normalize` reads transcript-like JSON input, validates required segment fields, sorts deterministically, assigns fresh IDs, and emits a selected seriatim output schema. ## Usage @@ -34,11 +34,30 @@ go run ./cmd/seriatim trim \ --keep "1-10, 15, 20-25" ``` +Normalize external transcript-style JSON: + +```sh +go run ./cmd/seriatim normalize \ + --input-file transcript.json \ + --output-file normalized.json +``` + +Normalize an Audita-style bare segment array to full schema with report output: + +```sh +go run ./cmd/seriatim normalize \ + --input-file audita-segments.json \ + --output-file normalized-full.json \ + --output-schema seriatim-full \ + --report-file normalize-report.json +``` + ## CLI ```text seriatim merge [flags] seriatim trim [flags] +seriatim normalize [flags] ``` Global flags: @@ -108,6 +127,43 @@ Global flags: - The report includes a `trim-audit` event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping. - Old-to-new ID mapping is emitted as a deterministic ordered array of `{old_id, new_id}` pairs. +`normalize` flags: + +| Flag | Required | Default | Description | +| --- | --- | --- | --- | +| `--input-file` | Yes | none | Input transcript JSON file. | +| `--output-file` | Yes | none | Normalized transcript JSON output path. | +| `--output-schema` | No | `seriatim-intermediate` (resolved via `SERIATIM_OUTPUT_SCHEMA` when set) | Output JSON schema: `seriatim-minimal`, `seriatim-intermediate`, or `seriatim-full`. | +| `--output-modules` | No | `json` | Comma-separated output modules. Current normalize support is `json` only. | +| `--report-file` | No | none | Optional report JSON output path. | + +`normalize` input shapes: + +- Top-level object with a `segments` array. +- Bare top-level array of segment objects (for example, Audita-style output). + +`normalize` required segment fields: + +- `start` +- `end` +- `speaker` +- `text` + +`normalize` behavior: + +- Validates `start >= 0`, `end >= start`, and non-empty `speaker`. +- Accepts existing input `id` values as provenance only. +- Reassigns output segment IDs sequentially from `1` to `N`. +- Sorts deterministically by `(start, end, original_input_index, speaker)`. +- Uses original input order only as a tie-breaker. +- Does not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect. +- Useful for converting external transcript outputs into standard seriatim artifacts. + +`normalize` report output: + +- When `--report-file` is provided, normalize emits deterministic report events with input shape detection, segment counts, schema/module selections, sorting/ID diagnostics, and output write/validation summaries. +- A machine-readable `normalize-audit` event is included for downstream tooling. + Environment variables: | Environment Variable | Default | Description | diff --git a/architecture.md b/architecture.md index 9cff91a..7f36866 100644 --- a/architecture.md +++ b/architecture.md @@ -3,7 +3,8 @@ `seriatim` is a deterministic transcript utility for: - merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and -- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming. +- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming, and +- canonicalizing external transcript-style JSON inputs into standard seriatim output schemas. The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events. @@ -60,7 +61,7 @@ configuration check Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations. -`merge` runs this pipeline. `trim` is intentionally separate from this pipeline and operates at the artifact layer. +`merge` runs this pipeline. `trim` and `normalize` are intentionally separate from this pipeline and operate at the artifact layer. ## Stage Contracts @@ -214,6 +215,24 @@ Design constraints: `trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`. +### 8. Artifact Canonicalization Stage (`normalize` command) + +`normalize` is an artifact-level command that reads transcript-like JSON and emits a standard seriatim output artifact in a selected schema. + +Design constraints: + +- `normalize` runs outside the merge pipeline and does not invoke merge preprocessing or postprocessing modules. +- `normalize` accepts two input shapes: object-with-`segments` and bare segment arrays. +- `normalize` validates required segment fields (`start`, `end`, `speaker`, `text`) and timing/speaker constraints. +- `normalize` sorts segments deterministically by chronological keys and stable input-index tie-breakers. +- `normalize` assigns fresh sequential output IDs (`1..N`) after sorting. +- `normalize` validates final output against the selected schema before writing. +- `normalize` writes optional deterministic report diagnostics when `--report-file` is requested. + +`normalize` is intended for canonicalizing external transcript outputs (including Audita-style bare arrays) into seriatim contracts, not for running merge-time language or overlap transformations. + +`normalize` must not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect. + ## Module Classification Modules should be classified by their contract and allowed effects. @@ -442,6 +461,13 @@ Trim-specific determinism requirements: - Old-to-new ID mapping in trim reports is emitted in deterministic order. - Full-schema overlap recomputation is deterministic for the same input artifact and selector. +Normalize-specific determinism requirements: + +- Input-shape detection is deterministic. +- Segment ordering is deterministic for identical input data. +- Output IDs are always reassigned sequentially after deterministic sorting. +- Normalize diagnostic reports are deterministic for identical inputs and configuration. + ## Go Package Layout ```text @@ -450,6 +476,7 @@ internal/config/ CLI/env/config loading and validation internal/pipeline/ Pipeline orchestration and module registry internal/builtin/ Built-in pipeline modules internal/artifact/ Conversion from internal model to public output schema +internal/normalize/ Normalize input parsing, validation, deterministic sorting, schema conversion, and diagnostics internal/trim/ Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema internal/buildinfo/ Build-time version metadata internal/speaker/ Speaker map parsing and lookup @@ -468,6 +495,12 @@ For trim: - CLI command code handles only flag parsing, file I/O, and report emission. - Transform logic is deterministic and pure except for command-layer I/O. +For normalize: + +- `internal/normalize` contains parsing/validation and deterministic schema conversion logic. +- CLI command code handles flag parsing and delegates execution. +- Normalize remains artifact-level and does not compose merge pipeline modules. + ## Default Modules The default pipeline is equivalent to explicit module lists.