Files

Eric Rakestraw e42a2326e8 Implemented an overlap detection module in the postprocessing chain

2026-04-26 20:39:49 -05:00

6.3 KiB

Raw Blame History

seriatim

seriatim merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order.

The current implementation supports the merge command. It reads one or more input JSON files, optionally maps each input file to a canonical speaker using speakers.yml, sorts all segments by timestamp, assigns consecutive numeric id values, and writes a merged JSON artifact.

Usage

Run from source:

go run ./cmd/seriatim merge \
  --input-file samples/raw/2026-04-19-Eric_Rakestraw.json \
  --input-file samples/raw/2026-04-19-Mike_Brown.json \
  --output-file merged.json

Optional report output:

go run ./cmd/seriatim merge \
  --input-file eric.json \
  --input-file mike.json \
  --output-file merged.json \
  --report-file report.json

CLI

seriatim merge [flags]

Required flags for the default pipeline:

--input-file: input transcript JSON file. Repeat once per speaker/input file.
--output-file: merged transcript JSON output path.

Optional flags:

--report-file: write a JSON report with pipeline events.
--speakers: speaker map YAML file. When omitted, input file basenames are used as speaker labels.
--autocorrect: autocorrect rules file. When omitted, the default autocorrect module no-ops.
--input-reader: input reader module. Default: json-files.
--output-modules: comma-separated output modules. Default: json.
--preprocessing-modules: comma-separated preprocessing modules. Default: validate-raw,normalize-speakers,trim-text.
--postprocessing-modules: comma-separated postprocessing modules. Default: detect-overlaps,resolve-overlaps,autocorrect,assign-ids,validate-output.

Input JSON Format

Each input file must be valid JSON with a top-level segments array. The current parser accepts the WhisperX segment subset needed for merging:

{
  "segments": [
    {
      "start": 1.25,
      "end": 3.5,
      "text": "Hello there."
    }
  ]
}

Required segment fields:

start: number, must be >= 0.
end: number, must be >= start.
text: string.

Other WhisperX fields, including words and raw diarization speaker labels, are ignored for now.

Speaker Map Format

speakers.yml maps input files to canonical speaker names using ordered substring rules:

This file is optional. If --speakers is omitted, seriatim uses each input file basename as the segment speaker label.

match:
  - speaker: "Eric Rakestraw"
    match:
      - "Eric_Rakestraw"
      - "Eric"

  - speaker: "Mike Brown"
    match:
      - "Mike_Brown"
      - "mb"

For each --input-file, seriatim takes the file basename and evaluates the rules in order. The first rule with a matching substring wins, and no later rules are evaluated.

For example, this input:

samples/raw/2026-04-19-Eric_Rakestraw.json

matches this rule because the basename contains Eric_Rakestraw:

- speaker: "Eric Rakestraw"
  match:
    - "Eric_Rakestraw"

Important details:

Matching is against the input file basename, not the full path.
Matching is case-insensitive.
Rules are evaluated from first to last.
Each rule must have a non-empty speaker.
Each rule must have at least one non-empty match string.
Duplicate speaker names are invalid.
Every input file must match at least one rule or the command fails.

Deprecated old format:

inputs:
  eric.json:
    speaker: "Eric Rakestraw"

The old inputs: direct mapping format is no longer supported.

Output JSON Format

The merged output uses the current seriatim envelope:

{
  "metadata": {
    "application": "seriatim",
    "version": "dev",
    "input_reader": "json-files",
    "input_files": ["eric.json", "mike.json"],
    "preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
    "postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "autocorrect", "assign-ids", "validate-output"],
    "output_modules": ["json"]
  },
  "segments": [
    {
      "id": 1,
      "source": "eric.json",
      "source_segment_index": 0,
      "speaker": "Eric Rakestraw",
      "start": 1.25,
      "end": 3.5,
      "text": "Hello there.",
      "overlap_group_id": 1
    }
  ],
  "overlap_groups": [
    {
      "id": 1,
      "start": 1.25,
      "end": 4.0,
      "segments": ["eric.json#0", "mike.json#0"],
      "speakers": ["Eric Rakestraw", "Mike Brown"],
      "class": "unknown",
      "resolution": "unresolved"
    }
  ]
}

Segments are sorted deterministically by:

(start, end, source, source_segment_index, speaker)

Final segment IDs are assigned after sorting and start at 1.

Overlap Detection

The default postprocessing pipeline detects overlapping segment groups.

Overlap behavior:

A strict timing overlap is required: next.start < current_group_end.
Segments that only touch at a boundary are not grouped.
Groups require at least two distinct speakers.
Transitive overlaps are grouped together.
Segments in detected groups receive overlap_group_id.
overlap_groups[].segments contains stable references in source#source_segment_index format.
class is currently unknown.
resolution is currently unresolved; overlap resolution is still a no-op.

Autocorrect

Autocorrect is included in the default postprocessing pipeline. If --autocorrect is omitted, the module leaves transcript text unchanged and records a skip event in the optional report.

Enable corrections by passing --autocorrect:

go run ./cmd/seriatim merge \
  --input-file input.json \
  --autocorrect autocorrect.yml \
  --output-file merged.json

autocorrect.yml format:

autocorrect:
  - target: "Hrank"
    match:
      - "hrank"
      - "Frank"

  - target: "Mike Brown"
    match:
      - "Mike Pat"

Matching behavior:

Matching is case-sensitive.
Matches apply only to whole tokens, not substrings inside larger words.
Punctuation and whitespace can surround a match.
Multi-word and hyphenated matches are supported.
Duplicate match strings are invalid, including duplicates across separate rules.

Current Limitations

Only JSON input is supported.
Word-level timing data is not preserved yet.
Overlap resolution is currently a no-op module.
Coalescing and alternate output formats are not implemented yet.

6.3 KiB Raw Blame History