Document trim command

This commit is contained in:
2026-05-08 14:57:52 +00:00
parent c48b02d2ec
commit 54f7717de8
2 changed files with 95 additions and 3 deletions

View File

@@ -1,8 +1,8 @@
# seriatim # seriatim
`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. `seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID.
The current implementation supports the `merge` command. It reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. The current implementation supports the `merge` and `trim` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset.
## Usage ## Usage
@@ -25,10 +25,20 @@ go run ./cmd/seriatim merge \
--report-file report.json --report-file report.json
``` ```
Trim an existing seriatim artifact:
```sh
go run ./cmd/seriatim trim \
--input-file merged.json \
--output-file trimmed.json \
--keep "1-10, 15, 20-25"
```
## CLI ## CLI
```text ```text
seriatim merge [flags] seriatim merge [flags]
seriatim trim [flags]
``` ```
Global flags: Global flags:
@@ -54,6 +64,50 @@ Global flags:
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. | | `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
| `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. | | `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |
`trim` flags:
| Flag | Required | Default | Description |
| --- | --- | --- | --- |
| `--input-file` | Yes | none | Input seriatim output artifact JSON file. |
| `--output-file` | Yes | none | Trimmed transcript JSON output path. |
| `--keep` | Exactly one of `--keep` or `--remove` is required | none | Segment ID selector to retain. |
| `--remove` | Exactly one of `--keep` or `--remove` is required | none | Segment ID selector to drop. |
| `--output-schema` | No | preserve input artifact schema | Optional output schema override: `seriatim-minimal`, `seriatim-intermediate`, or `seriatim-full`. |
| `--report-file` | No | none | Optional report JSON output path. |
| `--allow-empty` | No | `false` | Allow trimming to zero retained segments. |
`trim` selection rules:
- `--keep` and `--remove` are mutually exclusive.
- Exactly one of `--keep` or `--remove` is required.
- Selection is by segment ID only.
- Invalid selected segment IDs fail the command by default.
`trim` selector syntax:
- Segment IDs are positive 1-based integers.
- Inclusive ranges are supported: `1-10`.
- Comma-separated selectors are supported: `1-10,15,20-25`.
- Whitespace around numbers, commas, and hyphens is allowed: `1 - 10, 15, 20 - 25`.
- Duplicate and overlapping ranges are accepted and normalized as a union.
- Descending ranges (for example `10-1`) are rejected.
`trim` behavior:
- `trim` consumes existing seriatim JSON output artifacts only.
- `trim` does not accept raw WhisperX transcript JSON as input.
- Retained output segment IDs are renumbered sequentially from `1` to `N`.
- Transcript order is preserved from input transcript order; selector order does not reorder output.
- When output schema is `seriatim-full`, overlap groups are recomputed from retained segments.
- `--output-schema seriatim-full` is supported when trim has full-schema artifact data to emit; trim does not synthesize missing full-schema provenance from minimal/intermediate input artifacts.
- `trim` does not run merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
`trim` report output:
- When `--report-file` is provided, the report includes standard trim/validation/output events.
- The report includes a `trim-audit` event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping.
- Old-to-new ID mapping is emitted as a deterministic ordered array of `{old_id, new_id}` pairs.
Environment variables: Environment variables:
| Environment Variable | Default | Description | | Environment Variable | Default | Description |

View File

@@ -1,6 +1,9 @@
# seriatim Architecture # seriatim Architecture
`seriatim` is a deterministic transcript merge utility for combining multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript. `seriatim` is a deterministic transcript utility for:
- merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming.
The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events. The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.
@@ -20,6 +23,7 @@ The initial use case is merging independently transcribed speaker audio tracks f
8. Detect and annotate overlapping speech regions. 8. Detect and annotate overlapping speech regions.
9. Emit one or more output artifacts through output writers. 9. Emit one or more output artifacts through output writers.
10. Produce report data for validation findings, corrections, and transformations. 10. Produce report data for validation findings, corrections, and transformations.
11. Support artifact-level transcript projection commands that operate on existing seriatim output.
## Non-goals ## Non-goals
@@ -56,6 +60,8 @@ configuration check
Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations. Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.
`merge` runs this pipeline. `trim` is intentionally separate from this pipeline and operates at the artifact layer.
## Stage Contracts ## Stage Contracts
### 1. Configuration Check ### 1. Configuration Check
@@ -191,6 +197,23 @@ Future output formats may include:
Output writers should be selected from an explicit registry and should consume the final transcript model read-only. Multiple output writers may run for a single invocation. Output writers should be selected from an explicit registry and should consume the final transcript model read-only. Multiple output writers may run for a single invocation.
### 7. Artifact Projection Stage (`trim` command)
`trim` is an artifact-level command that reads an existing seriatim output artifact and emits a projected artifact containing a segment-ID subset.
Design constraints:
- `trim` runs after `merge`, not as a merge postprocessor.
- `trim` validates the input artifact against supported seriatim output schemas.
- `trim` performs deterministic keep/remove selection by segment ID.
- `trim` renumbers retained IDs to `1..N` in transcript order.
- `trim` validates the final output against the selected output schema before writing.
- `trim` records audit metadata in report output.
`trim` is intentionally separate from merge postprocessing because it consumes already-emitted public artifacts. This separation keeps merge semantics stable and avoids rerunning merge-only transforms on projected artifacts.
`trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
## Module Classification ## Module Classification
Modules should be classified by their contract and allowed effects. Modules should be classified by their contract and allowed effects.
@@ -397,6 +420,8 @@ A valid merged transcript should satisfy:
- Every referenced segment exists. - Every referenced segment exists.
- Output validates against the selected output schema. - Output validates against the selected output schema.
For full-schema trim output, overlap groups are recomputed from retained segments so overlap annotations and group references remain internally consistent after projection.
## Determinism Requirements ## Determinism Requirements
Given the same inputs, config, and application version, `seriatim` should produce byte-stable JSON output where practical. Given the same inputs, config, and application version, `seriatim` should produce byte-stable JSON output where practical.
@@ -411,6 +436,12 @@ To support this:
- Record application version in output metadata. - Record application version in output metadata.
- Record enabled module names and module order in output metadata or report data. - Record enabled module names and module order in output metadata or report data.
Trim-specific determinism requirements:
- Selector normalization and retained IDs are deterministic.
- Old-to-new ID mapping in trim reports is emitted in deterministic order.
- Full-schema overlap recomputation is deterministic for the same input artifact and selector.
## Go Package Layout ## Go Package Layout
```text ```text
@@ -419,6 +450,7 @@ internal/config/ CLI/env/config loading and validation
internal/pipeline/ Pipeline orchestration and module registry internal/pipeline/ Pipeline orchestration and module registry
internal/builtin/ Built-in pipeline modules internal/builtin/ Built-in pipeline modules
internal/artifact/ Conversion from internal model to public output schema internal/artifact/ Conversion from internal model to public output schema
internal/trim/ Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
internal/buildinfo/ Build-time version metadata internal/buildinfo/ Build-time version metadata
internal/speaker/ Speaker map parsing and lookup internal/speaker/ Speaker map parsing and lookup
internal/model/ Canonical and merged transcript models internal/model/ Canonical and merged transcript models
@@ -430,6 +462,12 @@ schema/ Public output contract and JSON Schema validation
Package boundaries should follow data ownership. Shared models belong in `internal/model`; stage-specific behavior belongs in the relevant stage package. Package boundaries should follow data ownership. Shared models belong in `internal/model`; stage-specific behavior belongs in the relevant stage package.
For trim:
- `internal/trim` contains pure transformation logic over artifact structs.
- CLI command code handles only flag parsing, file I/O, and report emission.
- Transform logic is deterministic and pure except for command-layer I/O.
## Default Modules ## Default Modules
The default pipeline is equivalent to explicit module lists. The default pipeline is equivalent to explicit module lists.