seriatim/README.md

# seriatim

`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID.

The current implementation supports the `merge` and `trim` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset.

## Usage

Run from source:

```sh
go run ./cmd/seriatim merge \
  --input-file samples/raw/2026-04-19-Eric_Rakestraw.json \
  --input-file samples/raw/2026-04-19-Mike_Brown.json \
  --output-file merged.json
```

Optional report output:

```sh
go run ./cmd/seriatim merge \
  --input-file eric.json \
  --input-file mike.json \
  --output-file merged.json \
  --report-file report.json
```

Trim an existing seriatim artifact:

```sh
go run ./cmd/seriatim trim \
  --input-file merged.json \
  --output-file trimmed.json \
  --keep "1-10, 15, 20-25"
```

## CLI

```text
seriatim merge [flags]
seriatim trim [flags]
```

Global flags:

| Flag | Description |
| --- | --- |
| `--help` | Show command help. |
| `--version` | Show application version. Local builds default to `dev`; release builds inject the release version. |

`merge` flags:

| Flag | Required | Default | Description |
| --- | --- | --- | --- |
| `--input-file` | Yes | none | Input transcript JSON file. Repeat once per speaker/input file. |
| `--output-file` | Yes | none | Merged transcript JSON output path. |
| `--report-file` | No | none | Optional report JSON output path. |
| `--speakers` | No | none | Speaker map YAML file. When omitted, input file basenames are used as speaker labels. |
| `--autocorrect` | No | none | Autocorrect rules YAML file. When omitted, the default `autocorrect` module leaves text unchanged. |
| `--input-reader` | No | `json-files` | Input reader module. |
| `--output-modules` | No | `json` | Comma-separated output modules. |
| `--output-schema` | No | `seriatim-intermediate` | JSON output contract. Allowed values are `seriatim-minimal`, `seriatim-intermediate`, and `seriatim-full`. If omitted, the runtime default is used; consumers that depend on a specific shape should set this explicitly. |
| `--preprocessing-modules` | No | `validate-raw,normalize-speakers,trim-text` | Comma-separated preprocessing modules, evaluated in order. |
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
| `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |

`trim` flags:

| Flag | Required | Default | Description |
| --- | --- | --- | --- |
| `--input-file` | Yes | none | Input seriatim output artifact JSON file. |
| `--output-file` | Yes | none | Trimmed transcript JSON output path. |
| `--keep` | Exactly one of `--keep` or `--remove` is required | none | Segment ID selector to retain. |
| `--remove` | Exactly one of `--keep` or `--remove` is required | none | Segment ID selector to drop. |
| `--output-schema` | No | preserve input artifact schema | Optional output schema override: `seriatim-minimal`, `seriatim-intermediate`, or `seriatim-full`. |
| `--report-file` | No | none | Optional report JSON output path. |
| `--allow-empty` | No | `false` | Allow trimming to zero retained segments. |

`trim` selection rules:

- `--keep` and `--remove` are mutually exclusive.
- Exactly one of `--keep` or `--remove` is required.
- Selection is by segment ID only.
- Invalid selected segment IDs fail the command by default.

`trim` selector syntax:

- Segment IDs are positive 1-based integers.
- Inclusive ranges are supported: `1-10`.
- Comma-separated selectors are supported: `1-10,15,20-25`.
- Whitespace around numbers, commas, and hyphens is allowed: `1 - 10, 15, 20 - 25`.
- Duplicate and overlapping ranges are accepted and normalized as a union.
- Descending ranges (for example `10-1`) are rejected.

`trim` behavior:

- `trim` consumes existing seriatim JSON output artifacts only.
- `trim` does not accept raw WhisperX transcript JSON as input.
- Retained output segment IDs are renumbered sequentially from `1` to `N`.
- Transcript order is preserved from input transcript order; selector order does not reorder output.
- When output schema is `seriatim-full`, overlap groups are recomputed from retained segments.
- `--output-schema seriatim-full` is supported when trim has full-schema artifact data to emit; trim does not synthesize missing full-schema provenance from minimal/intermediate input artifacts.
- `trim` does not run merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.

`trim` report output:

- When `--report-file` is provided, the report includes standard trim/validation/output events.
- The report includes a `trim-audit` event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping.
- Old-to-new ID mapping is emitted as a deterministic ordered array of `{old_id, new_id}` pairs.

Environment variables:

| Environment Variable | Default | Description |
| --- | --- | --- |
| `SERIATIM_OUTPUT_SCHEMA` | `seriatim-intermediate` | Output schema used when `--output-schema` is not explicitly provided. Allowed values are `seriatim-minimal`, `seriatim-intermediate`, and `seriatim-full`. The CLI flag takes precedence. |
| `SERIATIM_OVERLAP_WORD_RUN_GAP` | `1.0` | Maximum gap in seconds between adjacent timed words when `resolve-overlaps` builds word-run replacement segments. Must be a positive float. |
| `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW` | `1.0` | Near-start window in seconds for ordering replacement word runs shortest-first. Must be a positive float. |
| `SERIATIM_BACKCHANNEL_MAX_DURATION` | `2.0` | Maximum duration in seconds for `backchannel` classification. Must be a positive float. |
| `SERIATIM_FILLER_MAX_DURATION` | `1.25` | Maximum duration in seconds for `filler` classification. Must be a positive float. |

## Input JSON Format

Each input file must be valid JSON with a top-level `segments` array. The current parser accepts the WhisperX segment subset needed for merging:

```json
{
  "segments": [
    {
      "start": 1.25,
      "end": 3.5,
      "text": "Hello there.",
      "words": [
        {"word": "Hello", "start": 1.25, "end": 1.55, "score": 0.98},
        {"word": "there.", "start": 1.7, "end": 2.0}
      ]
    }
  ]
}
```

Required segment fields:

- `start`: number, must be `>= 0`.
- `end`: number, must be `>= start`.
- `text`: string.

Optional word fields:

- `words`: array of word timing objects.
- `words[].word`: string.
- `words[].start`: optional number, must be `>= 0` when present.
- `words[].end`: optional number, must be `>= start` when present with `start`.
- `words[].score`: optional number.
- `words[].speaker`: optional raw speaker label string.

Word-level timing is preserved internally for overlap resolution. If a word is missing `start` or `end`, seriatim keeps the word text, emits a warning in the optional report, and does not use that word as a timing anchor. Word timing is not emitted in the final JSON artifact.

## Speaker Map Format

`speakers.yml` maps input files to canonical speaker names using ordered substring rules:

This file is optional. If `--speakers` is omitted, `seriatim` uses each input file basename as the segment speaker label.

```yaml
match:
  - speaker: "Eric Rakestraw"
    match:
      - "Eric_Rakestraw"
      - "Eric"

  - speaker: "Mike Brown"
    match:
      - "Mike_Brown"
      - "mb"
```

For each `--input-file`, `seriatim` takes the file basename and evaluates the rules in order. The first rule with a matching substring wins, and no later rules are evaluated.

For example, this input:

```text
samples/raw/2026-04-19-Eric_Rakestraw.json
```

matches this rule because the basename contains `Eric_Rakestraw`:

```yaml
- speaker: "Eric Rakestraw"
  match:
    - "Eric_Rakestraw"
```

Important details:

- Matching is against the input file basename, not the full path.
- Matching is case-insensitive.
- Rules are evaluated from first to last.
- Each rule must have a non-empty `speaker`.
- Each rule must have at least one non-empty `match` string.
- Duplicate speaker names are invalid.
- Every input file must match at least one rule or the command fails.

Deprecated old format:

```yaml
inputs:
  eric.json:
    speaker: "Eric Rakestraw"
```

The old `inputs:` direct mapping format is no longer supported.

## Output JSON Format

`--output-modules json` controls the writer. `--output-schema` controls the JSON contract that writer serializes.

The named schemas are stable public contracts. If a consumer depends on a specific shape, it should request that schema explicitly at runtime. The runtime default selection may change in a future release.

The `seriatim-intermediate` schema is the current default selection when neither `--output-schema` nor `SERIATIM_OUTPUT_SCHEMA` is set. It stays close to the minimal schema, but adds optional `categories` on each segment:

```json
{
  "metadata": {
    "application": "seriatim",
    "version": "dev",
    "output_schema": "seriatim-intermediate"
  },
  "segments": [
    {
      "id": 1,
      "start": 1.25,
      "end": 3.5,
      "speaker": "Eric Rakestraw",
      "text": "Hello there.",
      "categories": ["backchannel"]
    }
  ]
}
```

The `seriatim-full` schema uses the full seriatim envelope:

```json
{
  "metadata": {
    "application": "seriatim",
    "version": "dev",
    "input_reader": "json-files",
    "input_files": ["eric.json", "mike.json"],
    "preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
    "postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "resolve-danglers", "coalesce", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
    "output_modules": ["json"]
  },
  "segments": [
    {
      "id": 1,
      "source": "eric.json",
      "source_segment_index": 0,
      "speaker": "Eric Rakestraw",
      "start": 1.25,
      "end": 3.5,
      "text": "Hello there.",
      "overlap_group_id": 1
    },
    {
      "id": 2,
      "source": "eric.json",
      "source_ref": "word-run:1:1:1",
      "derived_from": ["eric.json#0"],
      "speaker": "Eric Rakestraw",
      "start": 2.0,
      "end": 2.5,
      "text": "Resolved word run",
      "categories": ["backchannel"]
    }
  ],
  "overlap_groups": [
    {
      "id": 1,
      "start": 1.25,
      "end": 4.0,
      "segments": ["eric.json#0", "mike.json#0"],
      "speakers": ["Eric Rakestraw", "Mike Brown"],
      "class": "unknown",
      "resolution": "unresolved"
    }
  ]
}
```

The `seriatim-minimal` schema emits minimal metadata and compact ordered segments:

```json
{
  "metadata": {
    "application": "seriatim",
    "version": "dev",
    "output_schema": "seriatim-minimal"
  },
  "segments": [
    {
      "id": 1,
      "start": 1.25,
      "end": 3.5,
      "speaker": "Eric Rakestraw",
      "text": "Hello there."
    }
  ]
}
```

Minimal output intentionally omits categories, overlap groups, source/provenance fields, and pipeline configuration metadata.

Intermediate output intentionally omits overlap groups and source/provenance fields, but keeps optional `categories` and minimal metadata.

Segments are sorted deterministically by:

```text
(start, end, source, source_segment_index/source_ref, speaker)
```

Final segment IDs are assigned after sorting and start at `1`.

The public Go output contract is available from:

```go
import "gitea.maximumdirect.net/eric/seriatim/schema"
```

The same package embeds machine-readable JSON Schemas in `schema/full-output.schema.json`, `schema/intermediate-output.schema.json`, and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.

## Overlap Detection

The default postprocessing pipeline detects overlapping segment groups.

Overlap behavior:

- A strict timing overlap is required: `next.start < current_group_end`.
- Segments that only touch at a boundary are not grouped.
- Groups require at least two distinct speakers.
- Transitive overlaps are grouped together.
- Segments in detected groups receive `overlap_group_id`.
- `overlap_groups[].segments` contains stable references in `source#source_segment_index` format.
- `class` is currently `unknown`.
- `resolution` is `unresolved` until `resolve-overlaps` replaces the group.

## Overlap Resolution

The default postprocessing pipeline runs `detect-overlaps`, then `resolve-overlaps`, then `backchannel`, then `filler`, then `resolve-danglers`, then `coalesce`, then a second `detect-overlaps` pass.

For each detected overlap group, `resolve-overlaps` uses preserved WhisperX word timing to build smaller word-run replacement segments:

- The resolution window expands the detected overlap group by `--coalesce-gap` seconds on both sides.
- Nearby same-speaker context segments are included when they intersect the expanded window and their start or end is within `--coalesce-gap` of the original overlap boundary.
- Once a segment is selected for replacement, all timed words from that segment participate in word-run construction; the window controls segment selection, not per-word clipping.
- Context segments that are part of another detected overlap group are not pulled into the current group.
- Untimed words are included in replacement text in original word order when nearby timed words create a replacement run.
- Untimed words do not affect replacement segment start/end times or word-run gap splitting.
- Words for the same speaker are merged into one run when the gap between adjacent words is no greater than `SERIATIM_OVERLAP_WORD_RUN_GAP`.
- The default word-run gap is `1.0` seconds.
- Set `SERIATIM_OVERLAP_WORD_RUN_GAP` to a positive number of seconds to override the default.
- Near-start replacement word runs are reordered so shorter segments come first when adjacent starts are within `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW`.
- The default word-run reorder window is `1.0` seconds.
- Set `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW` to a positive number of seconds to override the default.
- Replacement segment text is built by joining word text with single spaces.
- Replacement segments include `source_ref` and `derived_from`.
- Replacement segments omit `source_segment_index` because they are derived from one or more original segments.
- Resolved overlap groups are removed before the second detection pass.
- Replacement segments are left without `overlap_group_id` until the second detection pass annotates any remaining overlap.
- If a speaker has no usable word timing in a group, that speaker's original segment is kept.
- If no speakers in a group have usable word timing, the original group and annotations remain unchanged.

## Backchannels

The default pipeline runs `backchannel` before `coalesce`. It tags short acknowledgement segments with:

```json
"categories": ["backchannel"]
```

Backchannel matching is case-insensitive, ignores punctuation for matching and word-count purposes, trims surrounding whitespace, and requires a matching acknowledgement phrase, no more than three whitespace-delimited words, and duration no greater than `SERIATIM_BACKCHANNEL_MAX_DURATION` seconds. The default maximum duration is `2.0` seconds.

## Fillers

The default pipeline runs `filler` after `backchannel` and before `coalesce`. It tags short filler utterances with:

```json
"categories": ["filler"]
```

Filler matching is case-insensitive, ignores punctuation for matching and word-count purposes, trims surrounding whitespace, and requires only filler tokens such as `um`, `uh`, `er`, `erm`, `ah`, `eh`, `hmm`, `mm`, or repeated combinations of those tokens. Matching segments must contain no more than three whitespace-delimited words and have duration no greater than `SERIATIM_FILLER_MAX_DURATION` seconds. The default maximum duration is `1.25` seconds.

## Dangler Resolution

The default pipeline runs `resolve-danglers` before `coalesce` and before the second overlap detection pass. It repairs short derived fragments when they share provenance with a nearby segment:

- Dangling-end fragments have no more than two words and end in punctuation.
- Dangling-start fragments have no more than two words.
- Matching uses same-speaker segments with any shared `derived_from` value.
- Merged segments use `source_ref` values such as `resolve-danglers:1`, keep the target segment's transcript position, and union `derived_from`.

## Coalescing

The default pipeline runs `coalesce` after `resolve-danglers` and before the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when `next.start - current.end <= --coalesce-gap`.

Coalesced segments use `source_ref` values such as `coalesce:1`, include `derived_from`, and omit `source_segment_index`.

Different-speaker backchannel and filler segments do not block coalescing of surrounding same-speaker segments. Same-speaker backchannel and filler segments are merged normally when they are within `--coalesce-gap`. When same-speaker segments are coalesced, any `backchannel` or `filler` category from the merged inputs is dropped from the coalesced segment.

## Autocorrect

Autocorrect is included in the default postprocessing pipeline. If `--autocorrect` is omitted, the module leaves transcript text unchanged and records a skip event in the optional report.

Enable corrections by passing `--autocorrect`:

```sh
go run ./cmd/seriatim merge \
  --input-file input.json \
  --autocorrect autocorrect.yml \
  --output-file merged.json
```

`autocorrect.yml` format:

```yaml
autocorrect:
  - target: "Hrank"
    match:
      - "hrank"
      - "Frank"

  - target: "Mike Brown"
    match:
      - "Mike Pat"
```

Matching behavior:

- Matching is case-sensitive.
- Matches apply only to whole tokens, not substrings inside larger words.
- Punctuation and whitespace can surround a match.
- Multi-word and hyphenated matches are supported.
- Duplicate match strings are invalid, including duplicates across separate rules.

## Current Limitations

- Only JSON input is supported.
- Overlap resolution depends on WhisperX word timing; groups without usable word timing remain unresolved.
- Alternate output formats are not implemented yet.

## Release Builds

Local builds record version metadata as `dev`. Release builds should inject the release version with `ldflags`:

```sh
go build -ldflags "-X gitea.maximumdirect.net/eric/seriatim/internal/buildinfo.Version=v1.0.0" ./cmd/seriatim
```