Document trim command

This commit is contained in:
2026-05-08 14:57:52 +00:00
parent c48b02d2ec
commit 54f7717de8
2 changed files with 95 additions and 3 deletions

View File

@@ -1,6 +1,9 @@
# seriatim Architecture
`seriatim` is a deterministic transcript merge utility for combining multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript.
`seriatim` is a deterministic transcript utility for:
- merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming.
The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.
@@ -20,6 +23,7 @@ The initial use case is merging independently transcribed speaker audio tracks f
8. Detect and annotate overlapping speech regions.
9. Emit one or more output artifacts through output writers.
10. Produce report data for validation findings, corrections, and transformations.
11. Support artifact-level transcript projection commands that operate on existing seriatim output.
## Non-goals
@@ -56,6 +60,8 @@ configuration check
Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.
`merge` runs this pipeline. `trim` is intentionally separate from this pipeline and operates at the artifact layer.
## Stage Contracts
### 1. Configuration Check
@@ -191,6 +197,23 @@ Future output formats may include:
Output writers should be selected from an explicit registry and should consume the final transcript model read-only. Multiple output writers may run for a single invocation.
### 7. Artifact Projection Stage (`trim` command)
`trim` is an artifact-level command that reads an existing seriatim output artifact and emits a projected artifact containing a segment-ID subset.
Design constraints:
- `trim` runs after `merge`, not as a merge postprocessor.
- `trim` validates the input artifact against supported seriatim output schemas.
- `trim` performs deterministic keep/remove selection by segment ID.
- `trim` renumbers retained IDs to `1..N` in transcript order.
- `trim` validates the final output against the selected output schema before writing.
- `trim` records audit metadata in report output.
`trim` is intentionally separate from merge postprocessing because it consumes already-emitted public artifacts. This separation keeps merge semantics stable and avoids rerunning merge-only transforms on projected artifacts.
`trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
## Module Classification
Modules should be classified by their contract and allowed effects.
@@ -397,6 +420,8 @@ A valid merged transcript should satisfy:
- Every referenced segment exists.
- Output validates against the selected output schema.
For full-schema trim output, overlap groups are recomputed from retained segments so overlap annotations and group references remain internally consistent after projection.
## Determinism Requirements
Given the same inputs, config, and application version, `seriatim` should produce byte-stable JSON output where practical.
@@ -411,6 +436,12 @@ To support this:
- Record application version in output metadata.
- Record enabled module names and module order in output metadata or report data.
Trim-specific determinism requirements:
- Selector normalization and retained IDs are deterministic.
- Old-to-new ID mapping in trim reports is emitted in deterministic order.
- Full-schema overlap recomputation is deterministic for the same input artifact and selector.
## Go Package Layout
```text
@@ -419,6 +450,7 @@ internal/config/ CLI/env/config loading and validation
internal/pipeline/ Pipeline orchestration and module registry
internal/builtin/ Built-in pipeline modules
internal/artifact/ Conversion from internal model to public output schema
internal/trim/ Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
internal/buildinfo/ Build-time version metadata
internal/speaker/ Speaker map parsing and lookup
internal/model/ Canonical and merged transcript models
@@ -430,6 +462,12 @@ schema/ Public output contract and JSON Schema validation
Package boundaries should follow data ownership. Shared models belong in `internal/model`; stage-specific behavior belongs in the relevant stage package.
For trim:
- `internal/trim` contains pure transformation logic over artifact structs.
- CLI command code handles only flag parsing, file I/O, and report emission.
- Transform logic is deterministic and pure except for command-layer I/O.
## Default Modules
The default pipeline is equivalent to explicit module lists.