seriatim merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID.

The current implementation supports the merge and trim commands. merge reads one or more input JSON files, optionally maps each input file to a canonical speaker using speakers.yml, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric id values, and writes a merged JSON artifact. trim reads an existing seriatim output artifact and projects it to a retained segment subset.

Usage

Run from source:

go run ./cmd/seriatim merge \
  --input-file samples/raw/2026-04-19-Eric_Rakestraw.json \
  --input-file samples/raw/2026-04-19-Mike_Brown.json \
  --output-file merged.json

Optional report output:

go run ./cmd/seriatim merge \
  --input-file eric.json \
  --input-file mike.json \
  --output-file merged.json \
  --report-file report.json

Trim an existing seriatim artifact:

go run ./cmd/seriatim trim \
  --input-file merged.json \
  --output-file trimmed.json \
  --keep "1-10, 15, 20-25"

CLI

seriatim merge [flags]
seriatim trim [flags]

Global flags:

Flag	Description
`--help`	Show command help.
`--version`	Show application version. Local builds default to `dev`; release builds inject the release version.

merge flags:

Flag	Required	Default	Description
`--input-file`	Yes	none	Input transcript JSON file. Repeat once per speaker/input file.
`--output-file`	Yes	none	Merged transcript JSON output path.
`--report-file`	No	none	Optional report JSON output path.
`--speakers`	No	none	Speaker map YAML file. When omitted, input file basenames are used as speaker labels.
`--autocorrect`	No	none	Autocorrect rules YAML file. When omitted, the default `autocorrect` module leaves text unchanged.
`--input-reader`	No	`json-files`	Input reader module.
`--output-modules`	No	`json`	Comma-separated output modules.
`--output-schema`	No	`seriatim-intermediate`	JSON output contract. Allowed values are `seriatim-minimal`, `seriatim-intermediate`, and `seriatim-full`. If omitted, the runtime default is used; consumers that depend on a specific shape should set this explicitly.
`--preprocessing-modules`	No	`validate-raw,normalize-speakers,trim-text`	Comma-separated preprocessing modules, evaluated in order.
`--postprocessing-modules`	No	`detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output`	Comma-separated postprocessing modules, evaluated in order.
`--coalesce-gap`	No	`3.0`	Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float.

trim flags:

Flag	Required	Default	Description
`--input-file`	Yes	none	Input seriatim output artifact JSON file.
`--output-file`	Yes	none	Trimmed transcript JSON output path.
`--keep`	Exactly one of `--keep` or `--remove` is required	none	Segment ID selector to retain.
`--remove`	Exactly one of `--keep` or `--remove` is required	none	Segment ID selector to drop.
`--output-schema`	No	preserve input artifact schema	Optional output schema override: `seriatim-minimal`, `seriatim-intermediate`, or `seriatim-full`.
`--report-file`	No	none	Optional report JSON output path.
`--allow-empty`	No	`false`	Allow trimming to zero retained segments.

trim selection rules:

--keep and --remove are mutually exclusive.
Exactly one of --keep or --remove is required.
Selection is by segment ID only.
Invalid selected segment IDs fail the command by default.

trim selector syntax:

Segment IDs are positive 1-based integers.
Inclusive ranges are supported: 1-10.
Comma-separated selectors are supported: 1-10,15,20-25.
Whitespace around numbers, commas, and hyphens is allowed: 1 - 10, 15, 20 - 25.
Duplicate and overlapping ranges are accepted and normalized as a union.
Descending ranges (for example 10-1) are rejected.

trim behavior:

trim consumes existing seriatim JSON output artifacts only.
trim does not accept raw WhisperX transcript JSON as input.
Retained output segment IDs are renumbered sequentially from 1 to N.
Transcript order is preserved from input transcript order; selector order does not reorder output.
When output schema is seriatim-full, overlap groups are recomputed from retained segments.
--output-schema seriatim-full is supported when trim has full-schema artifact data to emit; trim does not synthesize missing full-schema provenance from minimal/intermediate input artifacts.
trim does not run merge postprocessors such as resolve-overlaps, coalesce, or autocorrect.

trim report output:

When --report-file is provided, the report includes standard trim/validation/output events.
The report includes a trim-audit event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping.
Old-to-new ID mapping is emitted as a deterministic ordered array of {old_id, new_id} pairs.

Environment variables:

Environment Variable	Default	Description
`SERIATIM_OUTPUT_SCHEMA`	`seriatim-intermediate`	Output schema used when `--output-schema` is not explicitly provided. Allowed values are `seriatim-minimal`, `seriatim-intermediate`, and `seriatim-full`. The CLI flag takes precedence.
`SERIATIM_OVERLAP_WORD_RUN_GAP`	`1.0`	Maximum gap in seconds between adjacent timed words when `resolve-overlaps` builds word-run replacement segments. Must be a positive float.
`SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW`	`1.0`	Near-start window in seconds for ordering replacement word runs shortest-first. Must be a positive float.
`SERIATIM_BACKCHANNEL_MAX_DURATION`	`2.0`	Maximum duration in seconds for `backchannel` classification. Must be a positive float.
`SERIATIM_FILLER_MAX_DURATION`	`1.25`	Maximum duration in seconds for `filler` classification. Must be a positive float.

Input JSON Format

Each input file must be valid JSON with a top-level segments array. The current parser accepts the WhisperX segment subset needed for merging:

{
  "segments": [
    {
      "start": 1.25,
      "end": 3.5,
      "text": "Hello there.",
      "words": [
        {"word": "Hello", "start": 1.25, "end": 1.55, "score": 0.98},
        {"word": "there.", "start": 1.7, "end": 2.0}
      ]
    }
  ]
}

Required segment fields:

start: number, must be >= 0.
end: number, must be >= start.
text: string.

Optional word fields:

words: array of word timing objects.
words[].word: string.
words[].start: optional number, must be >= 0 when present.
words[].end: optional number, must be >= start when present with start.
words[].score: optional number.
words[].speaker: optional raw speaker label string.

Word-level timing is preserved internally for overlap resolution. If a word is missing start or end, seriatim keeps the word text, emits a warning in the optional report, and does not use that word as a timing anchor. Word timing is not emitted in the final JSON artifact.

Speaker Map Format

speakers.yml maps input files to canonical speaker names using ordered substring rules:

This file is optional. If --speakers is omitted, seriatim uses each input file basename as the segment speaker label.

match:
  - speaker: "Eric Rakestraw"
    match:
      - "Eric_Rakestraw"
      - "Eric"

  - speaker: "Mike Brown"
    match:
      - "Mike_Brown"
      - "mb"

For each --input-file, seriatim takes the file basename and evaluates the rules in order. The first rule with a matching substring wins, and no later rules are evaluated.

For example, this input:

samples/raw/2026-04-19-Eric_Rakestraw.json

matches this rule because the basename contains Eric_Rakestraw:

- speaker: "Eric Rakestraw"
  match:
    - "Eric_Rakestraw"

Important details:

Matching is against the input file basename, not the full path.
Matching is case-insensitive.
Rules are evaluated from first to last.
Each rule must have a non-empty speaker.
Each rule must have at least one non-empty match string.
Duplicate speaker names are invalid.
Every input file must match at least one rule or the command fails.

Deprecated old format:

inputs:
  eric.json:
    speaker: "Eric Rakestraw"

The old inputs: direct mapping format is no longer supported.

Output JSON Format

--output-modules json controls the writer. --output-schema controls the JSON contract that writer serializes.

The named schemas are stable public contracts. If a consumer depends on a specific shape, it should request that schema explicitly at runtime. The runtime default selection may change in a future release.

The seriatim-intermediate schema is the current default selection when neither --output-schema nor SERIATIM_OUTPUT_SCHEMA is set. It stays close to the minimal schema, but adds optional categories on each segment:

{
  "metadata": {
    "application": "seriatim",
    "version": "dev",
    "output_schema": "seriatim-intermediate"
  },
  "segments": [
    {
      "id": 1,
      "start": 1.25,
      "end": 3.5,
      "speaker": "Eric Rakestraw",
      "text": "Hello there.",
      "categories": ["backchannel"]
    }
  ]
}

The seriatim-full schema uses the full seriatim envelope:

{
  "metadata": {
    "application": "seriatim",
    "version": "dev",
    "input_reader": "json-files",
    "input_files": ["eric.json", "mike.json"],
    "preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
    "postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "resolve-danglers", "coalesce", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
    "output_modules": ["json"]
  },
  "segments": [
    {
      "id": 1,
      "source": "eric.json",
      "source_segment_index": 0,
      "speaker": "Eric Rakestraw",
      "start": 1.25,
      "end": 3.5,
      "text": "Hello there.",
      "overlap_group_id": 1
    },
    {
      "id": 2,
      "source": "eric.json",
      "source_ref": "word-run:1:1:1",
      "derived_from": ["eric.json#0"],
      "speaker": "Eric Rakestraw",
      "start": 2.0,
      "end": 2.5,
      "text": "Resolved word run",
      "categories": ["backchannel"]
    }
  ],
  "overlap_groups": [
    {
      "id": 1,
      "start": 1.25,
      "end": 4.0,
      "segments": ["eric.json#0", "mike.json#0"],
      "speakers": ["Eric Rakestraw", "Mike Brown"],
      "class": "unknown",
      "resolution": "unresolved"
    }
  ]
}

The seriatim-minimal schema emits minimal metadata and compact ordered segments:

{
  "metadata": {
    "application": "seriatim",
    "version": "dev",
    "output_schema": "seriatim-minimal"
  },
  "segments": [
    {
      "id": 1,
      "start": 1.25,
      "end": 3.5,
      "speaker": "Eric Rakestraw",
      "text": "Hello there."
    }
  ]
}

Minimal output intentionally omits categories, overlap groups, source/provenance fields, and pipeline configuration metadata.

Intermediate output intentionally omits overlap groups and source/provenance fields, but keeps optional categories and minimal metadata.

Segments are sorted deterministically by:

(start, end, source, source_segment_index/source_ref, speaker)

Final segment IDs are assigned after sorting and start at 1.

The public Go output contract is available from:

import "gitea.maximumdirect.net/eric/seriatim/schema"

The same package embeds machine-readable JSON Schemas in schema/full-output.schema.json, schema/intermediate-output.schema.json, and schema/minimal-output.schema.json. The default validate-output postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at 1.

Overlap Detection

The default postprocessing pipeline detects overlapping segment groups.

Overlap behavior:

A strict timing overlap is required: next.start < current_group_end.
Segments that only touch at a boundary are not grouped.
Groups require at least two distinct speakers.
Transitive overlaps are grouped together.
Segments in detected groups receive overlap_group_id.
overlap_groups[].segments contains stable references in source#source_segment_index format.
class is currently unknown.
resolution is unresolved until resolve-overlaps replaces the group.

Overlap Resolution

The default postprocessing pipeline runs detect-overlaps, then resolve-overlaps, then backchannel, then filler, then resolve-danglers, then coalesce, then a second detect-overlaps pass.

For each detected overlap group, resolve-overlaps uses preserved WhisperX word timing to build smaller word-run replacement segments:

The resolution window expands the detected overlap group by --coalesce-gap seconds on both sides.
Nearby same-speaker context segments are included when they intersect the expanded window and their start or end is within --coalesce-gap of the original overlap boundary.
Once a segment is selected for replacement, all timed words from that segment participate in word-run construction; the window controls segment selection, not per-word clipping.
Context segments that are part of another detected overlap group are not pulled into the current group.
Untimed words are included in replacement text in original word order when nearby timed words create a replacement run.
Untimed words do not affect replacement segment start/end times or word-run gap splitting.
Words for the same speaker are merged into one run when the gap between adjacent words is no greater than SERIATIM_OVERLAP_WORD_RUN_GAP.
The default word-run gap is 1.0 seconds.
Set SERIATIM_OVERLAP_WORD_RUN_GAP to a positive number of seconds to override the default.
Near-start replacement word runs are reordered so shorter segments come first when adjacent starts are within SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW.
The default word-run reorder window is 1.0 seconds.
Set SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW to a positive number of seconds to override the default.
Replacement segment text is built by joining word text with single spaces.
Replacement segments include source_ref and derived_from.
Replacement segments omit source_segment_index because they are derived from one or more original segments.
Resolved overlap groups are removed before the second detection pass.
Replacement segments are left without overlap_group_id until the second detection pass annotates any remaining overlap.
If a speaker has no usable word timing in a group, that speaker's original segment is kept.
If no speakers in a group have usable word timing, the original group and annotations remain unchanged.

Backchannels

The default pipeline runs backchannel before coalesce. It tags short acknowledgement segments with:

"categories": ["backchannel"]

Backchannel matching is case-insensitive, ignores punctuation for matching and word-count purposes, trims surrounding whitespace, and requires a matching acknowledgement phrase, no more than three whitespace-delimited words, and duration no greater than SERIATIM_BACKCHANNEL_MAX_DURATION seconds. The default maximum duration is 2.0 seconds.

Fillers

The default pipeline runs filler after backchannel and before coalesce. It tags short filler utterances with:

"categories": ["filler"]

Filler matching is case-insensitive, ignores punctuation for matching and word-count purposes, trims surrounding whitespace, and requires only filler tokens such as um, uh, er, erm, ah, eh, hmm, mm, or repeated combinations of those tokens. Matching segments must contain no more than three whitespace-delimited words and have duration no greater than SERIATIM_FILLER_MAX_DURATION seconds. The default maximum duration is 1.25 seconds.

Dangler Resolution

The default pipeline runs resolve-danglers before coalesce and before the second overlap detection pass. It repairs short derived fragments when they share provenance with a nearby segment:

Dangling-end fragments have no more than two words and end in punctuation.
Dangling-start fragments have no more than two words.
Matching uses same-speaker segments with any shared derived_from value.
Merged segments use source_ref values such as resolve-danglers:1, keep the target segment's transcript position, and union derived_from.

Coalescing

The default pipeline runs coalesce after resolve-danglers and before the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when next.start - current.end <= --coalesce-gap.

Coalesced segments use source_ref values such as coalesce:1, include derived_from, and omit source_segment_index.

Different-speaker backchannel and filler segments do not block coalescing of surrounding same-speaker segments. Same-speaker backchannel and filler segments are merged normally when they are within --coalesce-gap. When same-speaker segments are coalesced, any backchannel or filler category from the merged inputs is dropped from the coalesced segment.

Autocorrect

Autocorrect is included in the default postprocessing pipeline. If --autocorrect is omitted, the module leaves transcript text unchanged and records a skip event in the optional report.

Enable corrections by passing --autocorrect:

go run ./cmd/seriatim merge \
  --input-file input.json \
  --autocorrect autocorrect.yml \
  --output-file merged.json

autocorrect.yml format:

autocorrect:
  - target: "Hrank"
    match:
      - "hrank"
      - "Frank"

  - target: "Mike Brown"
    match:
      - "Mike Pat"

Matching behavior:

Matching is case-sensitive.
Matches apply only to whole tokens, not substrings inside larger words.
Punctuation and whitespace can surround a match.
Multi-word and hyphenated matches are supported.
Duplicate match strings are invalid, including duplicates across separate rules.

Current Limitations

Only JSON input is supported.
Overlap resolution depends on WhisperX word timing; groups without usable word timing remain unresolved.
Alternate output formats are not implemented yet.

Release Builds

Local builds record version metadata as dev. Release builds should inject the release version with ldflags:

go build -ldflags "-X gitea.maximumdirect.net/eric/seriatim/internal/buildinfo.Version=v1.0.0" ./cmd/seriatim