Document normalize command
This commit is contained in:
60
README.md
60
README.md
@@ -1,8 +1,8 @@
|
|||||||
# seriatim
|
# seriatim
|
||||||
|
|
||||||
`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID.
|
`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID and normalizes external transcript-like JSON into standard seriatim output schemas.
|
||||||
|
|
||||||
The current implementation supports the `merge` and `trim` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset.
|
The current implementation supports the `merge`, `trim`, and `normalize` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset. `normalize` reads transcript-like JSON input, validates required segment fields, sorts deterministically, assigns fresh IDs, and emits a selected seriatim output schema.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
@@ -34,11 +34,30 @@ go run ./cmd/seriatim trim \
|
|||||||
--keep "1-10, 15, 20-25"
|
--keep "1-10, 15, 20-25"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Normalize external transcript-style JSON:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
go run ./cmd/seriatim normalize \
|
||||||
|
--input-file transcript.json \
|
||||||
|
--output-file normalized.json
|
||||||
|
```
|
||||||
|
|
||||||
|
Normalize an Audita-style bare segment array to full schema with report output:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
go run ./cmd/seriatim normalize \
|
||||||
|
--input-file audita-segments.json \
|
||||||
|
--output-file normalized-full.json \
|
||||||
|
--output-schema seriatim-full \
|
||||||
|
--report-file normalize-report.json
|
||||||
|
```
|
||||||
|
|
||||||
## CLI
|
## CLI
|
||||||
|
|
||||||
```text
|
```text
|
||||||
seriatim merge [flags]
|
seriatim merge [flags]
|
||||||
seriatim trim [flags]
|
seriatim trim [flags]
|
||||||
|
seriatim normalize [flags]
|
||||||
```
|
```
|
||||||
|
|
||||||
Global flags:
|
Global flags:
|
||||||
@@ -108,6 +127,43 @@ Global flags:
|
|||||||
- The report includes a `trim-audit` event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping.
|
- The report includes a `trim-audit` event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping.
|
||||||
- Old-to-new ID mapping is emitted as a deterministic ordered array of `{old_id, new_id}` pairs.
|
- Old-to-new ID mapping is emitted as a deterministic ordered array of `{old_id, new_id}` pairs.
|
||||||
|
|
||||||
|
`normalize` flags:
|
||||||
|
|
||||||
|
| Flag | Required | Default | Description |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| `--input-file` | Yes | none | Input transcript JSON file. |
|
||||||
|
| `--output-file` | Yes | none | Normalized transcript JSON output path. |
|
||||||
|
| `--output-schema` | No | `seriatim-intermediate` (resolved via `SERIATIM_OUTPUT_SCHEMA` when set) | Output JSON schema: `seriatim-minimal`, `seriatim-intermediate`, or `seriatim-full`. |
|
||||||
|
| `--output-modules` | No | `json` | Comma-separated output modules. Current normalize support is `json` only. |
|
||||||
|
| `--report-file` | No | none | Optional report JSON output path. |
|
||||||
|
|
||||||
|
`normalize` input shapes:
|
||||||
|
|
||||||
|
- Top-level object with a `segments` array.
|
||||||
|
- Bare top-level array of segment objects (for example, Audita-style output).
|
||||||
|
|
||||||
|
`normalize` required segment fields:
|
||||||
|
|
||||||
|
- `start`
|
||||||
|
- `end`
|
||||||
|
- `speaker`
|
||||||
|
- `text`
|
||||||
|
|
||||||
|
`normalize` behavior:
|
||||||
|
|
||||||
|
- Validates `start >= 0`, `end >= start`, and non-empty `speaker`.
|
||||||
|
- Accepts existing input `id` values as provenance only.
|
||||||
|
- Reassigns output segment IDs sequentially from `1` to `N`.
|
||||||
|
- Sorts deterministically by `(start, end, original_input_index, speaker)`.
|
||||||
|
- Uses original input order only as a tie-breaker.
|
||||||
|
- Does not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect.
|
||||||
|
- Useful for converting external transcript outputs into standard seriatim artifacts.
|
||||||
|
|
||||||
|
`normalize` report output:
|
||||||
|
|
||||||
|
- When `--report-file` is provided, normalize emits deterministic report events with input shape detection, segment counts, schema/module selections, sorting/ID diagnostics, and output write/validation summaries.
|
||||||
|
- A machine-readable `normalize-audit` event is included for downstream tooling.
|
||||||
|
|
||||||
Environment variables:
|
Environment variables:
|
||||||
|
|
||||||
| Environment Variable | Default | Description |
|
| Environment Variable | Default | Description |
|
||||||
|
|||||||
@@ -3,7 +3,8 @@
|
|||||||
`seriatim` is a deterministic transcript utility for:
|
`seriatim` is a deterministic transcript utility for:
|
||||||
|
|
||||||
- merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
|
- merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
|
||||||
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming.
|
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming, and
|
||||||
|
- canonicalizing external transcript-style JSON inputs into standard seriatim output schemas.
|
||||||
|
|
||||||
The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.
|
The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.
|
||||||
|
|
||||||
@@ -60,7 +61,7 @@ configuration check
|
|||||||
|
|
||||||
Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.
|
Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.
|
||||||
|
|
||||||
`merge` runs this pipeline. `trim` is intentionally separate from this pipeline and operates at the artifact layer.
|
`merge` runs this pipeline. `trim` and `normalize` are intentionally separate from this pipeline and operate at the artifact layer.
|
||||||
|
|
||||||
## Stage Contracts
|
## Stage Contracts
|
||||||
|
|
||||||
@@ -214,6 +215,24 @@ Design constraints:
|
|||||||
|
|
||||||
`trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
|
`trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
|
||||||
|
|
||||||
|
### 8. Artifact Canonicalization Stage (`normalize` command)
|
||||||
|
|
||||||
|
`normalize` is an artifact-level command that reads transcript-like JSON and emits a standard seriatim output artifact in a selected schema.
|
||||||
|
|
||||||
|
Design constraints:
|
||||||
|
|
||||||
|
- `normalize` runs outside the merge pipeline and does not invoke merge preprocessing or postprocessing modules.
|
||||||
|
- `normalize` accepts two input shapes: object-with-`segments` and bare segment arrays.
|
||||||
|
- `normalize` validates required segment fields (`start`, `end`, `speaker`, `text`) and timing/speaker constraints.
|
||||||
|
- `normalize` sorts segments deterministically by chronological keys and stable input-index tie-breakers.
|
||||||
|
- `normalize` assigns fresh sequential output IDs (`1..N`) after sorting.
|
||||||
|
- `normalize` validates final output against the selected schema before writing.
|
||||||
|
- `normalize` writes optional deterministic report diagnostics when `--report-file` is requested.
|
||||||
|
|
||||||
|
`normalize` is intended for canonicalizing external transcript outputs (including Audita-style bare arrays) into seriatim contracts, not for running merge-time language or overlap transformations.
|
||||||
|
|
||||||
|
`normalize` must not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect.
|
||||||
|
|
||||||
## Module Classification
|
## Module Classification
|
||||||
|
|
||||||
Modules should be classified by their contract and allowed effects.
|
Modules should be classified by their contract and allowed effects.
|
||||||
@@ -442,6 +461,13 @@ Trim-specific determinism requirements:
|
|||||||
- Old-to-new ID mapping in trim reports is emitted in deterministic order.
|
- Old-to-new ID mapping in trim reports is emitted in deterministic order.
|
||||||
- Full-schema overlap recomputation is deterministic for the same input artifact and selector.
|
- Full-schema overlap recomputation is deterministic for the same input artifact and selector.
|
||||||
|
|
||||||
|
Normalize-specific determinism requirements:
|
||||||
|
|
||||||
|
- Input-shape detection is deterministic.
|
||||||
|
- Segment ordering is deterministic for identical input data.
|
||||||
|
- Output IDs are always reassigned sequentially after deterministic sorting.
|
||||||
|
- Normalize diagnostic reports are deterministic for identical inputs and configuration.
|
||||||
|
|
||||||
## Go Package Layout
|
## Go Package Layout
|
||||||
|
|
||||||
```text
|
```text
|
||||||
@@ -450,6 +476,7 @@ internal/config/ CLI/env/config loading and validation
|
|||||||
internal/pipeline/ Pipeline orchestration and module registry
|
internal/pipeline/ Pipeline orchestration and module registry
|
||||||
internal/builtin/ Built-in pipeline modules
|
internal/builtin/ Built-in pipeline modules
|
||||||
internal/artifact/ Conversion from internal model to public output schema
|
internal/artifact/ Conversion from internal model to public output schema
|
||||||
|
internal/normalize/ Normalize input parsing, validation, deterministic sorting, schema conversion, and diagnostics
|
||||||
internal/trim/ Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
|
internal/trim/ Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
|
||||||
internal/buildinfo/ Build-time version metadata
|
internal/buildinfo/ Build-time version metadata
|
||||||
internal/speaker/ Speaker map parsing and lookup
|
internal/speaker/ Speaker map parsing and lookup
|
||||||
@@ -468,6 +495,12 @@ For trim:
|
|||||||
- CLI command code handles only flag parsing, file I/O, and report emission.
|
- CLI command code handles only flag parsing, file I/O, and report emission.
|
||||||
- Transform logic is deterministic and pure except for command-layer I/O.
|
- Transform logic is deterministic and pure except for command-layer I/O.
|
||||||
|
|
||||||
|
For normalize:
|
||||||
|
|
||||||
|
- `internal/normalize` contains parsing/validation and deterministic schema conversion logic.
|
||||||
|
- CLI command code handles flag parsing and delegates execution.
|
||||||
|
- Normalize remains artifact-level and does not compose merge pipeline modules.
|
||||||
|
|
||||||
## Default Modules
|
## Default Modules
|
||||||
|
|
||||||
The default pipeline is equivalent to explicit module lists.
|
The default pipeline is equivalent to explicit module lists.
|
||||||
|
|||||||
Reference in New Issue
Block a user