Document normalize command

2026-05-09 12:35:48 +00:00
parent 5b008e272c
commit 3591041fa8
2 changed files with 93 additions and 4 deletions
--- a/README.md
+++ b/README.md
@@ -1,8 +1,8 @@
 # seriatim

-`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID.
+`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID and normalizes external transcript-like JSON into standard seriatim output schemas.

-The current implementation supports the `merge` and `trim` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset.
+The current implementation supports the `merge`, `trim`, and `normalize` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset. `normalize` reads transcript-like JSON input, validates required segment fields, sorts deterministically, assigns fresh IDs, and emits a selected seriatim output schema.

 ## Usage

@@ -34,11 +34,30 @@ go run ./cmd/seriatim trim \
  --keep "1-10, 15, 20-25"
 ```

+Normalize external transcript-style JSON:
+
+```sh
+go run ./cmd/seriatim normalize \
+  --input-file transcript.json \
+  --output-file normalized.json
+```
+
+Normalize an Audita-style bare segment array to full schema with report output:
+
+```sh
+go run ./cmd/seriatim normalize \
+  --input-file audita-segments.json \
+  --output-file normalized-full.json \
+  --output-schema seriatim-full \
+  --report-file normalize-report.json
+```
+
 ## CLI

 ```text
 seriatim merge [flags]
 seriatim trim [flags]
+seriatim normalize [flags]
 ```

 Global flags:
@@ -108,6 +127,43 @@ Global flags:
 - The report includes a `trim-audit` event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping.
 - Old-to-new ID mapping is emitted as a deterministic ordered array of `{old_id, new_id}` pairs.

+`normalize` flags:
+
+| Flag | Required | Default | Description |
+| --- | --- | --- | --- |
+| `--input-file` | Yes | none | Input transcript JSON file. |
+| `--output-file` | Yes | none | Normalized transcript JSON output path. |
+| `--output-schema` | No | `seriatim-intermediate` (resolved via `SERIATIM_OUTPUT_SCHEMA` when set) | Output JSON schema: `seriatim-minimal`, `seriatim-intermediate`, or `seriatim-full`. |
+| `--output-modules` | No | `json` | Comma-separated output modules. Current normalize support is `json` only. |
+| `--report-file` | No | none | Optional report JSON output path. |
+
+`normalize` input shapes:
+
+- Top-level object with a `segments` array.
+- Bare top-level array of segment objects (for example, Audita-style output).
+
+`normalize` required segment fields:
+
+- `start`
+- `end`
+- `speaker`
+- `text`
+
+`normalize` behavior:
+
+- Validates `start >= 0`, `end >= start`, and non-empty `speaker`.
+- Accepts existing input `id` values as provenance only.
+- Reassigns output segment IDs sequentially from `1` to `N`.
+- Sorts deterministically by `(start, end, original_input_index, speaker)`.
+- Uses original input order only as a tie-breaker.
+- Does not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect.
+- Useful for converting external transcript outputs into standard seriatim artifacts.
+
+`normalize` report output:
+
+- When `--report-file` is provided, normalize emits deterministic report events with input shape detection, segment counts, schema/module selections, sorting/ID diagnostics, and output write/validation summaries.
+- A machine-readable `normalize-audit` event is included for downstream tooling.
+
 Environment variables:

 | Environment Variable | Default | Description |