11 KiB
seriatim
seriatim merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order.
The current implementation supports the merge command. It reads one or more input JSON files, optionally maps each input file to a canonical speaker using speakers.yml, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric id values, and writes a merged JSON artifact.
Usage
Run from source:
go run ./cmd/seriatim merge \
--input-file samples/raw/2026-04-19-Eric_Rakestraw.json \
--input-file samples/raw/2026-04-19-Mike_Brown.json \
--output-file merged.json
Optional report output:
go run ./cmd/seriatim merge \
--input-file eric.json \
--input-file mike.json \
--output-file merged.json \
--report-file report.json
CLI
seriatim merge [flags]
Required flags for the default pipeline:
--input-file: input transcript JSON file. Repeat once per speaker/input file.--output-file: merged transcript JSON output path.
Optional flags:
--report-file: write a JSON report with pipeline events.--speakers: speaker map YAML file. When omitted, input file basenames are used as speaker labels.--autocorrect: autocorrect rules file. When omitted, the defaultautocorrectmodule no-ops.--input-reader: input reader module. Default:json-files.--output-modules: comma-separated output modules. Default:json.--preprocessing-modules: comma-separated preprocessing modules. Default:validate-raw,normalize-speakers,trim-text.--postprocessing-modules: comma-separated postprocessing modules. Default:detect-overlaps,resolve-overlaps,backchannel,filler,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output.--coalesce-gap: maximum same-speaker gap in seconds forcoalesce. Default:3.0.
Input JSON Format
Each input file must be valid JSON with a top-level segments array. The current parser accepts the WhisperX segment subset needed for merging:
{
"segments": [
{
"start": 1.25,
"end": 3.5,
"text": "Hello there.",
"words": [
{"word": "Hello", "start": 1.25, "end": 1.55, "score": 0.98},
{"word": "there.", "start": 1.7, "end": 2.0}
]
}
]
}
Required segment fields:
start: number, must be>= 0.end: number, must be>= start.text: string.
Optional word fields:
words: array of word timing objects.words[].word: string.words[].start: optional number, must be>= 0when present.words[].end: optional number, must be>= startwhen present withstart.words[].score: optional number.words[].speaker: optional raw speaker label string.
Word-level timing is preserved internally for overlap resolution. If a word is missing start or end, seriatim keeps the word text, emits a warning in the optional report, and does not use that word as a timing anchor. Word timing is not emitted in the final JSON artifact.
Speaker Map Format
speakers.yml maps input files to canonical speaker names using ordered substring rules:
This file is optional. If --speakers is omitted, seriatim uses each input file basename as the segment speaker label.
match:
- speaker: "Eric Rakestraw"
match:
- "Eric_Rakestraw"
- "Eric"
- speaker: "Mike Brown"
match:
- "Mike_Brown"
- "mb"
For each --input-file, seriatim takes the file basename and evaluates the rules in order. The first rule with a matching substring wins, and no later rules are evaluated.
For example, this input:
samples/raw/2026-04-19-Eric_Rakestraw.json
matches this rule because the basename contains Eric_Rakestraw:
- speaker: "Eric Rakestraw"
match:
- "Eric_Rakestraw"
Important details:
- Matching is against the input file basename, not the full path.
- Matching is case-insensitive.
- Rules are evaluated from first to last.
- Each rule must have a non-empty
speaker. - Each rule must have at least one non-empty
matchstring. - Duplicate speaker names are invalid.
- Every input file must match at least one rule or the command fails.
Deprecated old format:
inputs:
eric.json:
speaker: "Eric Rakestraw"
The old inputs: direct mapping format is no longer supported.
Output JSON Format
The merged output uses the current seriatim envelope:
{
"metadata": {
"application": "seriatim",
"version": "dev",
"input_reader": "json-files",
"input_files": ["eric.json", "mike.json"],
"preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "coalesce", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
"output_modules": ["json"]
},
"segments": [
{
"id": 1,
"source": "eric.json",
"source_segment_index": 0,
"speaker": "Eric Rakestraw",
"start": 1.25,
"end": 3.5,
"text": "Hello there.",
"overlap_group_id": 1
},
{
"id": 2,
"source": "eric.json",
"source_ref": "word-run:1:1:1",
"derived_from": ["eric.json#0"],
"speaker": "Eric Rakestraw",
"start": 2.0,
"end": 2.5,
"text": "Resolved word run",
"categories": ["backchannel"]
}
],
"overlap_groups": [
{
"id": 1,
"start": 1.25,
"end": 4.0,
"segments": ["eric.json#0", "mike.json#0"],
"speakers": ["Eric Rakestraw", "Mike Brown"],
"class": "unknown",
"resolution": "unresolved"
}
]
}
Segments are sorted deterministically by:
(start, end, source, source_segment_index/source_ref, speaker)
Final segment IDs are assigned after sorting and start at 1.
Overlap Detection
The default postprocessing pipeline detects overlapping segment groups.
Overlap behavior:
- A strict timing overlap is required:
next.start < current_group_end. - Segments that only touch at a boundary are not grouped.
- Groups require at least two distinct speakers.
- Transitive overlaps are grouped together.
- Segments in detected groups receive
overlap_group_id. overlap_groups[].segmentscontains stable references insource#source_segment_indexformat.classis currentlyunknown.resolutionisunresolveduntilresolve-overlapsreplaces the group.
Overlap Resolution
The default postprocessing pipeline runs detect-overlaps, then resolve-overlaps, then backchannel, then filler, then coalesce, then a second detect-overlaps pass.
For each detected overlap group, resolve-overlaps uses preserved WhisperX word timing to build smaller word-run replacement segments:
- Words are included when their interval intersects the overlap window:
word.end > group.start && word.start < group.end. - Untimed words are included in replacement text in original word order when nearby timed words create a replacement run.
- Untimed words do not affect replacement segment start/end times or word-run gap splitting.
- Words for the same speaker are merged into one run when the gap between adjacent words is no greater than
SERIATIM_OVERLAP_WORD_RUN_GAP. - The default word-run gap is
0.75seconds. - Set
SERIATIM_OVERLAP_WORD_RUN_GAPto a positive number of seconds to override the default. - Near-start replacement word runs are reordered so shorter segments come first when adjacent starts are within
SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW. - The default word-run reorder window is
0.4seconds. - Set
SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOWto a positive number of seconds to override the default. - Replacement segment text is built by joining word text with single spaces.
- Replacement segments include
source_refandderived_from. - Replacement segments omit
source_segment_indexbecause they are derived from one or more original segments. - Resolved overlap groups are removed before the second detection pass.
- Replacement segments are left without
overlap_group_iduntil the second detection pass annotates any remaining overlap. - If a speaker has no usable word timing in a group, that speaker's original segment is kept.
- If no speakers in a group have usable word timing, the original group and annotations remain unchanged.
Backchannels
The default pipeline runs backchannel before coalesce. It tags short acknowledgement segments with:
"categories": ["backchannel"]
Backchannel matching is case-insensitive, trims surrounding whitespace, and requires a matching acknowledgement phrase, no more than three whitespace-delimited words, and duration no greater than 1.0 second.
Fillers
The default pipeline runs filler after backchannel and before coalesce. It tags short filler utterances with:
"categories": ["filler"]
Filler matching is case-insensitive, trims surrounding whitespace, and requires only filler tokens such as um, uh, er, erm, ah, eh, hmm, mm, or repeated combinations of those tokens. Matching segments must contain no more than three whitespace-delimited words and have duration no greater than 1.0 second.
Coalescing
The default pipeline runs coalesce before the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when next.start - current.end <= --coalesce-gap.
Coalesced segments use source_ref values such as coalesce:1, include derived_from, and omit source_segment_index.
Different-speaker backchannel and filler segments do not block coalescing of surrounding same-speaker segments. When same-speaker segments are coalesced, any backchannel or filler category from the merged inputs is dropped from the coalesced segment.
Autocorrect
Autocorrect is included in the default postprocessing pipeline. If --autocorrect is omitted, the module leaves transcript text unchanged and records a skip event in the optional report.
Enable corrections by passing --autocorrect:
go run ./cmd/seriatim merge \
--input-file input.json \
--autocorrect autocorrect.yml \
--output-file merged.json
autocorrect.yml format:
autocorrect:
- target: "Hrank"
match:
- "hrank"
- "Frank"
- target: "Mike Brown"
match:
- "Mike Pat"
Matching behavior:
- Matching is case-sensitive.
- Matches apply only to whole tokens, not substrings inside larger words.
- Punctuation and whitespace can surround a match.
- Multi-word and hyphenated matches are supported.
- Duplicate match strings are invalid, including duplicates across separate rules.
Current Limitations
- Only JSON input is supported.
- Overlap resolution depends on WhisperX word timing; groups without usable word timing remain unresolved.
- Coalescing and alternate output formats are not implemented yet.