# seriatim `seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. The current implementation supports the `merge` command. It reads one or more input JSON files, maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, assigns consecutive numeric `id` values, and writes a merged JSON artifact. ## Usage Run from source: ```sh go run ./cmd/seriatim merge \ --input-file samples/raw/2026-04-19-Eric_Rakestraw.json \ --input-file samples/raw/2026-04-19-Mike_Brown.json \ --speakers samples/speakers.yml \ --output-file merged.json ``` Optional report output: ```sh go run ./cmd/seriatim merge \ --input-file eric.json \ --input-file mike.json \ --speakers speakers.yml \ --output-file merged.json \ --report-file report.json ``` ## CLI ```text seriatim merge [flags] ``` Required flags for the default pipeline: - `--input-file`: input transcript JSON file. Repeat once per speaker/input file. - `--speakers`: speaker map YAML file. Required because `normalize-speakers` is enabled by default. - `--output-file`: merged transcript JSON output path. Optional flags: - `--report-file`: write a JSON report with pipeline events. - `--input-reader`: input reader module. Default: `json-files`. - `--output-modules`: comma-separated output modules. Default: `json`. - `--preprocessing-modules`: comma-separated preprocessing modules. Default: `validate-raw,normalize-speakers,trim-text`. - `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,assign-ids,validate-output`. - `--autocorrect`: autocorrect rules file. Required when the postprocessing `autocorrect` module is enabled. ## Input JSON Format Each input file must be valid JSON with a top-level `segments` array. The current parser accepts the WhisperX segment subset needed for merging: ```json { "segments": [ { "start": 1.25, "end": 3.5, "text": "Hello there." } ] } ``` Required segment fields: - `start`: number, must be `>= 0`. - `end`: number, must be `>= start`. - `text`: string. Other WhisperX fields, including `words` and raw diarization speaker labels, are ignored for now. ## Speaker Map Format `speakers.yml` maps input files to canonical speaker names using ordered substring rules: ```yaml match: - speaker: "Eric Rakestraw" match: - "Eric_Rakestraw" - "Eric" - speaker: "Mike Brown" match: - "Mike_Brown" - "mb" ``` For each `--input-file`, `seriatim` takes the file basename and evaluates the rules in order. The first rule with a matching substring wins, and no later rules are evaluated. For example, this input: ```text samples/raw/2026-04-19-Eric_Rakestraw.json ``` matches this rule because the basename contains `Eric_Rakestraw`: ```yaml - speaker: "Eric Rakestraw" match: - "Eric_Rakestraw" ``` Important details: - Matching is against the input file basename, not the full path. - Matching is case-insensitive. - Rules are evaluated from first to last. - Each rule must have a non-empty `speaker`. - Each rule must have at least one non-empty `match` string. - Duplicate speaker names are invalid. - Every input file must match at least one rule or the command fails. Deprecated old format: ```yaml inputs: eric.json: speaker: "Eric Rakestraw" ``` The old `inputs:` direct mapping format is no longer supported. ## Output JSON Format The merged output uses the current seriatim envelope: ```json { "metadata": { "application": "seriatim", "version": "dev", "input_reader": "json-files", "input_files": ["eric.json", "mike.json"], "preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"], "postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "assign-ids", "validate-output"], "output_modules": ["json"] }, "segments": [ { "id": 1, "source": "eric.json", "source_segment_index": 0, "speaker": "Eric Rakestraw", "start": 1.25, "end": 3.5, "text": "Hello there." } ], "overlap_groups": [] } ``` Segments are sorted deterministically by: ```text (start, end, source, source_segment_index, speaker) ``` Final segment IDs are assigned after sorting and start at `1`. ## Autocorrect Autocorrect is an opt-in postprocessing module. It is not part of the default pipeline. Enable it by adding `autocorrect` to `--postprocessing-modules` and passing `--autocorrect`: ```sh go run ./cmd/seriatim merge \ --input-file input.json \ --speakers speakers.yml \ --autocorrect autocorrect.yml \ --postprocessing-modules detect-overlaps,resolve-overlaps,autocorrect,assign-ids,validate-output \ --output-file merged.json ``` `autocorrect.yml` format: ```yaml autocorrect: - target: "Hrank" match: - "hrank" - "Frank" - target: "Mike Brown" match: - "Mike Pat" ``` Matching behavior: - Matching is case-sensitive. - Matches apply only to whole tokens, not substrings inside larger words. - Punctuation and whitespace can surround a match. - Multi-word and hyphenated matches are supported. - Duplicate match strings are invalid, including duplicates across separate rules. ## Current Limitations - Only JSON input is supported. - Word-level timing data is not preserved yet. - Overlap detection and overlap resolution are currently no-op modules. - Coalescing and alternate output formats are not implemented yet.