Document normalize command
This commit is contained in:
@@ -3,7 +3,8 @@
|
||||
`seriatim` is a deterministic transcript utility for:
|
||||
|
||||
- merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
|
||||
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming.
|
||||
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming, and
|
||||
- canonicalizing external transcript-style JSON inputs into standard seriatim output schemas.
|
||||
|
||||
The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.
|
||||
|
||||
@@ -60,7 +61,7 @@ configuration check
|
||||
|
||||
Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.
|
||||
|
||||
`merge` runs this pipeline. `trim` is intentionally separate from this pipeline and operates at the artifact layer.
|
||||
`merge` runs this pipeline. `trim` and `normalize` are intentionally separate from this pipeline and operate at the artifact layer.
|
||||
|
||||
## Stage Contracts
|
||||
|
||||
@@ -214,6 +215,24 @@ Design constraints:
|
||||
|
||||
`trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
|
||||
|
||||
### 8. Artifact Canonicalization Stage (`normalize` command)
|
||||
|
||||
`normalize` is an artifact-level command that reads transcript-like JSON and emits a standard seriatim output artifact in a selected schema.
|
||||
|
||||
Design constraints:
|
||||
|
||||
- `normalize` runs outside the merge pipeline and does not invoke merge preprocessing or postprocessing modules.
|
||||
- `normalize` accepts two input shapes: object-with-`segments` and bare segment arrays.
|
||||
- `normalize` validates required segment fields (`start`, `end`, `speaker`, `text`) and timing/speaker constraints.
|
||||
- `normalize` sorts segments deterministically by chronological keys and stable input-index tie-breakers.
|
||||
- `normalize` assigns fresh sequential output IDs (`1..N`) after sorting.
|
||||
- `normalize` validates final output against the selected schema before writing.
|
||||
- `normalize` writes optional deterministic report diagnostics when `--report-file` is requested.
|
||||
|
||||
`normalize` is intended for canonicalizing external transcript outputs (including Audita-style bare arrays) into seriatim contracts, not for running merge-time language or overlap transformations.
|
||||
|
||||
`normalize` must not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect.
|
||||
|
||||
## Module Classification
|
||||
|
||||
Modules should be classified by their contract and allowed effects.
|
||||
@@ -442,6 +461,13 @@ Trim-specific determinism requirements:
|
||||
- Old-to-new ID mapping in trim reports is emitted in deterministic order.
|
||||
- Full-schema overlap recomputation is deterministic for the same input artifact and selector.
|
||||
|
||||
Normalize-specific determinism requirements:
|
||||
|
||||
- Input-shape detection is deterministic.
|
||||
- Segment ordering is deterministic for identical input data.
|
||||
- Output IDs are always reassigned sequentially after deterministic sorting.
|
||||
- Normalize diagnostic reports are deterministic for identical inputs and configuration.
|
||||
|
||||
## Go Package Layout
|
||||
|
||||
```text
|
||||
@@ -450,6 +476,7 @@ internal/config/ CLI/env/config loading and validation
|
||||
internal/pipeline/ Pipeline orchestration and module registry
|
||||
internal/builtin/ Built-in pipeline modules
|
||||
internal/artifact/ Conversion from internal model to public output schema
|
||||
internal/normalize/ Normalize input parsing, validation, deterministic sorting, schema conversion, and diagnostics
|
||||
internal/trim/ Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
|
||||
internal/buildinfo/ Build-time version metadata
|
||||
internal/speaker/ Speaker map parsing and lookup
|
||||
@@ -468,6 +495,12 @@ For trim:
|
||||
- CLI command code handles only flag parsing, file I/O, and report emission.
|
||||
- Transform logic is deterministic and pure except for command-layer I/O.
|
||||
|
||||
For normalize:
|
||||
|
||||
- `internal/normalize` contains parsing/validation and deterministic schema conversion logic.
|
||||
- CLI command code handles flag parsing and delegates execution.
|
||||
- Normalize remains artifact-level and does not compose merge pipeline modules.
|
||||
|
||||
## Default Modules
|
||||
|
||||
The default pipeline is equivalent to explicit module lists.
|
||||
|
||||
Reference in New Issue
Block a user