Document trim command
This commit is contained in:
58
README.md
58
README.md
@@ -1,8 +1,8 @@
|
|||||||
# seriatim
|
# seriatim
|
||||||
|
|
||||||
`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order.
|
`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID.
|
||||||
|
|
||||||
The current implementation supports the `merge` command. It reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact.
|
The current implementation supports the `merge` and `trim` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
@@ -25,10 +25,20 @@ go run ./cmd/seriatim merge \
|
|||||||
--report-file report.json
|
--report-file report.json
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Trim an existing seriatim artifact:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
go run ./cmd/seriatim trim \
|
||||||
|
--input-file merged.json \
|
||||||
|
--output-file trimmed.json \
|
||||||
|
--keep "1-10, 15, 20-25"
|
||||||
|
```
|
||||||
|
|
||||||
## CLI
|
## CLI
|
||||||
|
|
||||||
```text
|
```text
|
||||||
seriatim merge [flags]
|
seriatim merge [flags]
|
||||||
|
seriatim trim [flags]
|
||||||
```
|
```
|
||||||
|
|
||||||
Global flags:
|
Global flags:
|
||||||
@@ -54,6 +64,50 @@ Global flags:
|
|||||||
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
|
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
|
||||||
| `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |
|
| `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |
|
||||||
|
|
||||||
|
`trim` flags:
|
||||||
|
|
||||||
|
| Flag | Required | Default | Description |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| `--input-file` | Yes | none | Input seriatim output artifact JSON file. |
|
||||||
|
| `--output-file` | Yes | none | Trimmed transcript JSON output path. |
|
||||||
|
| `--keep` | Exactly one of `--keep` or `--remove` is required | none | Segment ID selector to retain. |
|
||||||
|
| `--remove` | Exactly one of `--keep` or `--remove` is required | none | Segment ID selector to drop. |
|
||||||
|
| `--output-schema` | No | preserve input artifact schema | Optional output schema override: `seriatim-minimal`, `seriatim-intermediate`, or `seriatim-full`. |
|
||||||
|
| `--report-file` | No | none | Optional report JSON output path. |
|
||||||
|
| `--allow-empty` | No | `false` | Allow trimming to zero retained segments. |
|
||||||
|
|
||||||
|
`trim` selection rules:
|
||||||
|
|
||||||
|
- `--keep` and `--remove` are mutually exclusive.
|
||||||
|
- Exactly one of `--keep` or `--remove` is required.
|
||||||
|
- Selection is by segment ID only.
|
||||||
|
- Invalid selected segment IDs fail the command by default.
|
||||||
|
|
||||||
|
`trim` selector syntax:
|
||||||
|
|
||||||
|
- Segment IDs are positive 1-based integers.
|
||||||
|
- Inclusive ranges are supported: `1-10`.
|
||||||
|
- Comma-separated selectors are supported: `1-10,15,20-25`.
|
||||||
|
- Whitespace around numbers, commas, and hyphens is allowed: `1 - 10, 15, 20 - 25`.
|
||||||
|
- Duplicate and overlapping ranges are accepted and normalized as a union.
|
||||||
|
- Descending ranges (for example `10-1`) are rejected.
|
||||||
|
|
||||||
|
`trim` behavior:
|
||||||
|
|
||||||
|
- `trim` consumes existing seriatim JSON output artifacts only.
|
||||||
|
- `trim` does not accept raw WhisperX transcript JSON as input.
|
||||||
|
- Retained output segment IDs are renumbered sequentially from `1` to `N`.
|
||||||
|
- Transcript order is preserved from input transcript order; selector order does not reorder output.
|
||||||
|
- When output schema is `seriatim-full`, overlap groups are recomputed from retained segments.
|
||||||
|
- `--output-schema seriatim-full` is supported when trim has full-schema artifact data to emit; trim does not synthesize missing full-schema provenance from minimal/intermediate input artifacts.
|
||||||
|
- `trim` does not run merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
|
||||||
|
|
||||||
|
`trim` report output:
|
||||||
|
|
||||||
|
- When `--report-file` is provided, the report includes standard trim/validation/output events.
|
||||||
|
- The report includes a `trim-audit` event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping.
|
||||||
|
- Old-to-new ID mapping is emitted as a deterministic ordered array of `{old_id, new_id}` pairs.
|
||||||
|
|
||||||
Environment variables:
|
Environment variables:
|
||||||
|
|
||||||
| Environment Variable | Default | Description |
|
| Environment Variable | Default | Description |
|
||||||
|
|||||||
@@ -1,6 +1,9 @@
|
|||||||
# seriatim Architecture
|
# seriatim Architecture
|
||||||
|
|
||||||
`seriatim` is a deterministic transcript merge utility for combining multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript.
|
`seriatim` is a deterministic transcript utility for:
|
||||||
|
|
||||||
|
- merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
|
||||||
|
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming.
|
||||||
|
|
||||||
The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.
|
The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.
|
||||||
|
|
||||||
@@ -20,6 +23,7 @@ The initial use case is merging independently transcribed speaker audio tracks f
|
|||||||
8. Detect and annotate overlapping speech regions.
|
8. Detect and annotate overlapping speech regions.
|
||||||
9. Emit one or more output artifacts through output writers.
|
9. Emit one or more output artifacts through output writers.
|
||||||
10. Produce report data for validation findings, corrections, and transformations.
|
10. Produce report data for validation findings, corrections, and transformations.
|
||||||
|
11. Support artifact-level transcript projection commands that operate on existing seriatim output.
|
||||||
|
|
||||||
## Non-goals
|
## Non-goals
|
||||||
|
|
||||||
@@ -56,6 +60,8 @@ configuration check
|
|||||||
|
|
||||||
Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.
|
Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.
|
||||||
|
|
||||||
|
`merge` runs this pipeline. `trim` is intentionally separate from this pipeline and operates at the artifact layer.
|
||||||
|
|
||||||
## Stage Contracts
|
## Stage Contracts
|
||||||
|
|
||||||
### 1. Configuration Check
|
### 1. Configuration Check
|
||||||
@@ -191,6 +197,23 @@ Future output formats may include:
|
|||||||
|
|
||||||
Output writers should be selected from an explicit registry and should consume the final transcript model read-only. Multiple output writers may run for a single invocation.
|
Output writers should be selected from an explicit registry and should consume the final transcript model read-only. Multiple output writers may run for a single invocation.
|
||||||
|
|
||||||
|
### 7. Artifact Projection Stage (`trim` command)
|
||||||
|
|
||||||
|
`trim` is an artifact-level command that reads an existing seriatim output artifact and emits a projected artifact containing a segment-ID subset.
|
||||||
|
|
||||||
|
Design constraints:
|
||||||
|
|
||||||
|
- `trim` runs after `merge`, not as a merge postprocessor.
|
||||||
|
- `trim` validates the input artifact against supported seriatim output schemas.
|
||||||
|
- `trim` performs deterministic keep/remove selection by segment ID.
|
||||||
|
- `trim` renumbers retained IDs to `1..N` in transcript order.
|
||||||
|
- `trim` validates the final output against the selected output schema before writing.
|
||||||
|
- `trim` records audit metadata in report output.
|
||||||
|
|
||||||
|
`trim` is intentionally separate from merge postprocessing because it consumes already-emitted public artifacts. This separation keeps merge semantics stable and avoids rerunning merge-only transforms on projected artifacts.
|
||||||
|
|
||||||
|
`trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
|
||||||
|
|
||||||
## Module Classification
|
## Module Classification
|
||||||
|
|
||||||
Modules should be classified by their contract and allowed effects.
|
Modules should be classified by their contract and allowed effects.
|
||||||
@@ -397,6 +420,8 @@ A valid merged transcript should satisfy:
|
|||||||
- Every referenced segment exists.
|
- Every referenced segment exists.
|
||||||
- Output validates against the selected output schema.
|
- Output validates against the selected output schema.
|
||||||
|
|
||||||
|
For full-schema trim output, overlap groups are recomputed from retained segments so overlap annotations and group references remain internally consistent after projection.
|
||||||
|
|
||||||
## Determinism Requirements
|
## Determinism Requirements
|
||||||
|
|
||||||
Given the same inputs, config, and application version, `seriatim` should produce byte-stable JSON output where practical.
|
Given the same inputs, config, and application version, `seriatim` should produce byte-stable JSON output where practical.
|
||||||
@@ -411,6 +436,12 @@ To support this:
|
|||||||
- Record application version in output metadata.
|
- Record application version in output metadata.
|
||||||
- Record enabled module names and module order in output metadata or report data.
|
- Record enabled module names and module order in output metadata or report data.
|
||||||
|
|
||||||
|
Trim-specific determinism requirements:
|
||||||
|
|
||||||
|
- Selector normalization and retained IDs are deterministic.
|
||||||
|
- Old-to-new ID mapping in trim reports is emitted in deterministic order.
|
||||||
|
- Full-schema overlap recomputation is deterministic for the same input artifact and selector.
|
||||||
|
|
||||||
## Go Package Layout
|
## Go Package Layout
|
||||||
|
|
||||||
```text
|
```text
|
||||||
@@ -419,6 +450,7 @@ internal/config/ CLI/env/config loading and validation
|
|||||||
internal/pipeline/ Pipeline orchestration and module registry
|
internal/pipeline/ Pipeline orchestration and module registry
|
||||||
internal/builtin/ Built-in pipeline modules
|
internal/builtin/ Built-in pipeline modules
|
||||||
internal/artifact/ Conversion from internal model to public output schema
|
internal/artifact/ Conversion from internal model to public output schema
|
||||||
|
internal/trim/ Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
|
||||||
internal/buildinfo/ Build-time version metadata
|
internal/buildinfo/ Build-time version metadata
|
||||||
internal/speaker/ Speaker map parsing and lookup
|
internal/speaker/ Speaker map parsing and lookup
|
||||||
internal/model/ Canonical and merged transcript models
|
internal/model/ Canonical and merged transcript models
|
||||||
@@ -430,6 +462,12 @@ schema/ Public output contract and JSON Schema validation
|
|||||||
|
|
||||||
Package boundaries should follow data ownership. Shared models belong in `internal/model`; stage-specific behavior belongs in the relevant stage package.
|
Package boundaries should follow data ownership. Shared models belong in `internal/model`; stage-specific behavior belongs in the relevant stage package.
|
||||||
|
|
||||||
|
For trim:
|
||||||
|
|
||||||
|
- `internal/trim` contains pure transformation logic over artifact structs.
|
||||||
|
- CLI command code handles only flag parsing, file I/O, and report emission.
|
||||||
|
- Transform logic is deterministic and pure except for command-layer I/O.
|
||||||
|
|
||||||
## Default Modules
|
## Default Modules
|
||||||
|
|
||||||
The default pipeline is equivalent to explicit module lists.
|
The default pipeline is equivalent to explicit module lists.
|
||||||
|
|||||||
Reference in New Issue
Block a user