308 lines
11 KiB
Markdown
308 lines
11 KiB
Markdown
# seriatim
|
|
|
|
`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order.
|
|
|
|
The current implementation supports the `merge` command. It reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact.
|
|
|
|
## Usage
|
|
|
|
Run from source:
|
|
|
|
```sh
|
|
go run ./cmd/seriatim merge \
|
|
--input-file samples/raw/2026-04-19-Eric_Rakestraw.json \
|
|
--input-file samples/raw/2026-04-19-Mike_Brown.json \
|
|
--output-file merged.json
|
|
```
|
|
|
|
Optional report output:
|
|
|
|
```sh
|
|
go run ./cmd/seriatim merge \
|
|
--input-file eric.json \
|
|
--input-file mike.json \
|
|
--output-file merged.json \
|
|
--report-file report.json
|
|
```
|
|
|
|
## CLI
|
|
|
|
```text
|
|
seriatim merge [flags]
|
|
```
|
|
|
|
Required flags for the default pipeline:
|
|
|
|
- `--input-file`: input transcript JSON file. Repeat once per speaker/input file.
|
|
- `--output-file`: merged transcript JSON output path.
|
|
|
|
Optional flags:
|
|
|
|
- `--report-file`: write a JSON report with pipeline events.
|
|
- `--speakers`: speaker map YAML file. When omitted, input file basenames are used as speaker labels.
|
|
- `--autocorrect`: autocorrect rules file. When omitted, the default `autocorrect` module no-ops.
|
|
- `--input-reader`: input reader module. Default: `json-files`.
|
|
- `--output-modules`: comma-separated output modules. Default: `json`.
|
|
- `--preprocessing-modules`: comma-separated preprocessing modules. Default: `validate-raw,normalize-speakers,trim-text`.
|
|
- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,backchannel,filler,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output`.
|
|
- `--coalesce-gap`: maximum same-speaker gap in seconds for `coalesce`. Default: `3.0`.
|
|
|
|
## Input JSON Format
|
|
|
|
Each input file must be valid JSON with a top-level `segments` array. The current parser accepts the WhisperX segment subset needed for merging:
|
|
|
|
```json
|
|
{
|
|
"segments": [
|
|
{
|
|
"start": 1.25,
|
|
"end": 3.5,
|
|
"text": "Hello there.",
|
|
"words": [
|
|
{"word": "Hello", "start": 1.25, "end": 1.55, "score": 0.98},
|
|
{"word": "there.", "start": 1.7, "end": 2.0}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Required segment fields:
|
|
|
|
- `start`: number, must be `>= 0`.
|
|
- `end`: number, must be `>= start`.
|
|
- `text`: string.
|
|
|
|
Optional word fields:
|
|
|
|
- `words`: array of word timing objects.
|
|
- `words[].word`: string.
|
|
- `words[].start`: optional number, must be `>= 0` when present.
|
|
- `words[].end`: optional number, must be `>= start` when present with `start`.
|
|
- `words[].score`: optional number.
|
|
- `words[].speaker`: optional raw speaker label string.
|
|
|
|
Word-level timing is preserved internally for overlap resolution. If a word is missing `start` or `end`, seriatim keeps the word text, emits a warning in the optional report, and does not use that word as a timing anchor. Word timing is not emitted in the final JSON artifact.
|
|
|
|
## Speaker Map Format
|
|
|
|
`speakers.yml` maps input files to canonical speaker names using ordered substring rules:
|
|
|
|
This file is optional. If `--speakers` is omitted, `seriatim` uses each input file basename as the segment speaker label.
|
|
|
|
```yaml
|
|
match:
|
|
- speaker: "Eric Rakestraw"
|
|
match:
|
|
- "Eric_Rakestraw"
|
|
- "Eric"
|
|
|
|
- speaker: "Mike Brown"
|
|
match:
|
|
- "Mike_Brown"
|
|
- "mb"
|
|
```
|
|
|
|
For each `--input-file`, `seriatim` takes the file basename and evaluates the rules in order. The first rule with a matching substring wins, and no later rules are evaluated.
|
|
|
|
For example, this input:
|
|
|
|
```text
|
|
samples/raw/2026-04-19-Eric_Rakestraw.json
|
|
```
|
|
|
|
matches this rule because the basename contains `Eric_Rakestraw`:
|
|
|
|
```yaml
|
|
- speaker: "Eric Rakestraw"
|
|
match:
|
|
- "Eric_Rakestraw"
|
|
```
|
|
|
|
Important details:
|
|
|
|
- Matching is against the input file basename, not the full path.
|
|
- Matching is case-insensitive.
|
|
- Rules are evaluated from first to last.
|
|
- Each rule must have a non-empty `speaker`.
|
|
- Each rule must have at least one non-empty `match` string.
|
|
- Duplicate speaker names are invalid.
|
|
- Every input file must match at least one rule or the command fails.
|
|
|
|
Deprecated old format:
|
|
|
|
```yaml
|
|
inputs:
|
|
eric.json:
|
|
speaker: "Eric Rakestraw"
|
|
```
|
|
|
|
The old `inputs:` direct mapping format is no longer supported.
|
|
|
|
## Output JSON Format
|
|
|
|
The merged output uses the current seriatim envelope:
|
|
|
|
```json
|
|
{
|
|
"metadata": {
|
|
"application": "seriatim",
|
|
"version": "dev",
|
|
"input_reader": "json-files",
|
|
"input_files": ["eric.json", "mike.json"],
|
|
"preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
|
|
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "coalesce", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
|
|
"output_modules": ["json"]
|
|
},
|
|
"segments": [
|
|
{
|
|
"id": 1,
|
|
"source": "eric.json",
|
|
"source_segment_index": 0,
|
|
"speaker": "Eric Rakestraw",
|
|
"start": 1.25,
|
|
"end": 3.5,
|
|
"text": "Hello there.",
|
|
"overlap_group_id": 1
|
|
},
|
|
{
|
|
"id": 2,
|
|
"source": "eric.json",
|
|
"source_ref": "word-run:1:1:1",
|
|
"derived_from": ["eric.json#0"],
|
|
"speaker": "Eric Rakestraw",
|
|
"start": 2.0,
|
|
"end": 2.5,
|
|
"text": "Resolved word run",
|
|
"categories": ["backchannel"]
|
|
}
|
|
],
|
|
"overlap_groups": [
|
|
{
|
|
"id": 1,
|
|
"start": 1.25,
|
|
"end": 4.0,
|
|
"segments": ["eric.json#0", "mike.json#0"],
|
|
"speakers": ["Eric Rakestraw", "Mike Brown"],
|
|
"class": "unknown",
|
|
"resolution": "unresolved"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Segments are sorted deterministically by:
|
|
|
|
```text
|
|
(start, end, source, source_segment_index/source_ref, speaker)
|
|
```
|
|
|
|
Final segment IDs are assigned after sorting and start at `1`.
|
|
|
|
## Overlap Detection
|
|
|
|
The default postprocessing pipeline detects overlapping segment groups.
|
|
|
|
Overlap behavior:
|
|
|
|
- A strict timing overlap is required: `next.start < current_group_end`.
|
|
- Segments that only touch at a boundary are not grouped.
|
|
- Groups require at least two distinct speakers.
|
|
- Transitive overlaps are grouped together.
|
|
- Segments in detected groups receive `overlap_group_id`.
|
|
- `overlap_groups[].segments` contains stable references in `source#source_segment_index` format.
|
|
- `class` is currently `unknown`.
|
|
- `resolution` is `unresolved` until `resolve-overlaps` replaces the group.
|
|
|
|
## Overlap Resolution
|
|
|
|
The default postprocessing pipeline runs `detect-overlaps`, then `resolve-overlaps`, then `backchannel`, then `filler`, then `coalesce`, then a second `detect-overlaps` pass.
|
|
|
|
For each detected overlap group, `resolve-overlaps` uses preserved WhisperX word timing to build smaller word-run replacement segments:
|
|
|
|
- Words are included when their interval intersects the overlap window: `word.end > group.start && word.start < group.end`.
|
|
- Untimed words are included in replacement text in original word order when nearby timed words create a replacement run.
|
|
- Untimed words do not affect replacement segment start/end times or word-run gap splitting.
|
|
- Words for the same speaker are merged into one run when the gap between adjacent words is no greater than `SERIATIM_OVERLAP_WORD_RUN_GAP`.
|
|
- The default word-run gap is `0.75` seconds.
|
|
- Set `SERIATIM_OVERLAP_WORD_RUN_GAP` to a positive number of seconds to override the default.
|
|
- Near-start replacement word runs are reordered so shorter segments come first when adjacent starts are within `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW`.
|
|
- The default word-run reorder window is `0.4` seconds.
|
|
- Set `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW` to a positive number of seconds to override the default.
|
|
- Replacement segment text is built by joining word text with single spaces.
|
|
- Replacement segments include `source_ref` and `derived_from`.
|
|
- Replacement segments omit `source_segment_index` because they are derived from one or more original segments.
|
|
- Resolved overlap groups are removed before the second detection pass.
|
|
- Replacement segments are left without `overlap_group_id` until the second detection pass annotates any remaining overlap.
|
|
- If a speaker has no usable word timing in a group, that speaker's original segment is kept.
|
|
- If no speakers in a group have usable word timing, the original group and annotations remain unchanged.
|
|
|
|
## Backchannels
|
|
|
|
The default pipeline runs `backchannel` before `coalesce`. It tags short acknowledgement segments with:
|
|
|
|
```json
|
|
"categories": ["backchannel"]
|
|
```
|
|
|
|
Backchannel matching is case-insensitive, trims surrounding whitespace, and requires a matching acknowledgement phrase, no more than three whitespace-delimited words, and duration no greater than `1.0` second.
|
|
|
|
## Fillers
|
|
|
|
The default pipeline runs `filler` after `backchannel` and before `coalesce`. It tags short filler utterances with:
|
|
|
|
```json
|
|
"categories": ["filler"]
|
|
```
|
|
|
|
Filler matching is case-insensitive, trims surrounding whitespace, and requires only filler tokens such as `um`, `uh`, `er`, `erm`, `ah`, `eh`, `hmm`, `mm`, or repeated combinations of those tokens. Matching segments must contain no more than three whitespace-delimited words and have duration no greater than `1.0` second.
|
|
|
|
## Coalescing
|
|
|
|
The default pipeline runs `coalesce` before the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when `next.start - current.end <= --coalesce-gap`.
|
|
|
|
Coalesced segments use `source_ref` values such as `coalesce:1`, include `derived_from`, and omit `source_segment_index`.
|
|
|
|
Different-speaker backchannel and filler segments do not block coalescing of surrounding same-speaker segments. Same-speaker backchannel and filler segments are merged normally when they are within `--coalesce-gap`. When same-speaker segments are coalesced, any `backchannel` or `filler` category from the merged inputs is dropped from the coalesced segment.
|
|
|
|
## Autocorrect
|
|
|
|
Autocorrect is included in the default postprocessing pipeline. If `--autocorrect` is omitted, the module leaves transcript text unchanged and records a skip event in the optional report.
|
|
|
|
Enable corrections by passing `--autocorrect`:
|
|
|
|
```sh
|
|
go run ./cmd/seriatim merge \
|
|
--input-file input.json \
|
|
--autocorrect autocorrect.yml \
|
|
--output-file merged.json
|
|
```
|
|
|
|
`autocorrect.yml` format:
|
|
|
|
```yaml
|
|
autocorrect:
|
|
- target: "Hrank"
|
|
match:
|
|
- "hrank"
|
|
- "Frank"
|
|
|
|
- target: "Mike Brown"
|
|
match:
|
|
- "Mike Pat"
|
|
```
|
|
|
|
Matching behavior:
|
|
|
|
- Matching is case-sensitive.
|
|
- Matches apply only to whole tokens, not substrings inside larger words.
|
|
- Punctuation and whitespace can surround a match.
|
|
- Multi-word and hyphenated matches are supported.
|
|
- Duplicate match strings are invalid, including duplicates across separate rules.
|
|
|
|
## Current Limitations
|
|
|
|
- Only JSON input is supported.
|
|
- Overlap resolution depends on WhisperX word timing; groups without usable word timing remain unresolved.
|
|
- Coalescing and alternate output formats are not implemented yet.
|