seriatim/README.md

# seriatim

`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order.

The current implementation supports the `merge` command. It reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, assigns consecutive numeric `id` values, and writes a merged JSON artifact.

## Usage

Run from source:

```sh
go run ./cmd/seriatim merge \
  --input-file samples/raw/2026-04-19-Eric_Rakestraw.json \
  --input-file samples/raw/2026-04-19-Mike_Brown.json \
  --output-file merged.json
```

Optional report output:

```sh
go run ./cmd/seriatim merge \
  --input-file eric.json \
  --input-file mike.json \
  --output-file merged.json \
  --report-file report.json
```

## CLI

```text
seriatim merge [flags]
```

Required flags for the default pipeline:

- `--input-file`: input transcript JSON file. Repeat once per speaker/input file.
- `--output-file`: merged transcript JSON output path.

Optional flags:

- `--report-file`: write a JSON report with pipeline events.
- `--speakers`: speaker map YAML file. When omitted, input file basenames are used as speaker labels.
- `--autocorrect`: autocorrect rules file. When omitted, the default `autocorrect` module no-ops.
- `--input-reader`: input reader module. Default: `json-files`.
- `--output-modules`: comma-separated output modules. Default: `json`.
- `--preprocessing-modules`: comma-separated preprocessing modules. Default: `validate-raw,normalize-speakers,trim-text`.
- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,autocorrect,assign-ids,validate-output`.

## Input JSON Format

Each input file must be valid JSON with a top-level `segments` array. The current parser accepts the WhisperX segment subset needed for merging:

```json
{
  "segments": [
    {
      "start": 1.25,
      "end": 3.5,
      "text": "Hello there."
    }
  ]
}
```

Required segment fields:

- `start`: number, must be `>= 0`.
- `end`: number, must be `>= start`.
- `text`: string.

Other WhisperX fields, including `words` and raw diarization speaker labels, are ignored for now.

## Speaker Map Format

`speakers.yml` maps input files to canonical speaker names using ordered substring rules:

This file is optional. If `--speakers` is omitted, `seriatim` uses each input file basename as the segment speaker label.

```yaml
match:
  - speaker: "Eric Rakestraw"
    match:
      - "Eric_Rakestraw"
      - "Eric"

  - speaker: "Mike Brown"
    match:
      - "Mike_Brown"
      - "mb"
```

For each `--input-file`, `seriatim` takes the file basename and evaluates the rules in order. The first rule with a matching substring wins, and no later rules are evaluated.

For example, this input:

```text
samples/raw/2026-04-19-Eric_Rakestraw.json
```

matches this rule because the basename contains `Eric_Rakestraw`:

```yaml
- speaker: "Eric Rakestraw"
  match:
    - "Eric_Rakestraw"
```

Important details:

- Matching is against the input file basename, not the full path.
- Matching is case-insensitive.
- Rules are evaluated from first to last.
- Each rule must have a non-empty `speaker`.
- Each rule must have at least one non-empty `match` string.
- Duplicate speaker names are invalid.
- Every input file must match at least one rule or the command fails.

Deprecated old format:

```yaml
inputs:
  eric.json:
    speaker: "Eric Rakestraw"
```

The old `inputs:` direct mapping format is no longer supported.

## Output JSON Format

The merged output uses the current seriatim envelope:

```json
{
  "metadata": {
    "application": "seriatim",
    "version": "dev",
    "input_reader": "json-files",
    "input_files": ["eric.json", "mike.json"],
    "preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
    "postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "autocorrect", "assign-ids", "validate-output"],
    "output_modules": ["json"]
  },
  "segments": [
    {
      "id": 1,
      "source": "eric.json",
      "source_segment_index": 0,
      "speaker": "Eric Rakestraw",
      "start": 1.25,
      "end": 3.5,
      "text": "Hello there.",
      "overlap_group_id": 1
    }
  ],
  "overlap_groups": [
    {
      "id": 1,
      "start": 1.25,
      "end": 4.0,
      "segments": ["eric.json#0", "mike.json#0"],
      "speakers": ["Eric Rakestraw", "Mike Brown"],
      "class": "unknown",
      "resolution": "unresolved"
    }
  ]
}
```

Segments are sorted deterministically by:

```text
(start, end, source, source_segment_index, speaker)
```

Final segment IDs are assigned after sorting and start at `1`.

## Overlap Detection

The default postprocessing pipeline detects overlapping segment groups.

Overlap behavior:

- A strict timing overlap is required: `next.start < current_group_end`.
- Segments that only touch at a boundary are not grouped.
- Groups require at least two distinct speakers.
- Transitive overlaps are grouped together.
- Segments in detected groups receive `overlap_group_id`.
- `overlap_groups[].segments` contains stable references in `source#source_segment_index` format.
- `class` is currently `unknown`.
- `resolution` is currently `unresolved`; overlap resolution is still a no-op.

## Autocorrect

Autocorrect is included in the default postprocessing pipeline. If `--autocorrect` is omitted, the module leaves transcript text unchanged and records a skip event in the optional report.

Enable corrections by passing `--autocorrect`:

```sh
go run ./cmd/seriatim merge \
  --input-file input.json \
  --autocorrect autocorrect.yml \
  --output-file merged.json
```

`autocorrect.yml` format:

```yaml
autocorrect:
  - target: "Hrank"
    match:
      - "hrank"
      - "Frank"

  - target: "Mike Brown"
    match:
      - "Mike Pat"
```

Matching behavior:

- Matching is case-sensitive.
- Matches apply only to whole tokens, not substrings inside larger words.
- Punctuation and whitespace can surround a match.
- Multi-word and hyphenated matches are supported.
- Duplicate match strings are invalid, including duplicates across separate rules.

## Current Limitations

- Only JSON input is supported.
- Word-level timing data is not preserved yet.
- Overlap resolution is currently a no-op module.
- Coalescing and alternate output formats are not implemented yet.