Added initial segment overlap resolution logic

2026-04-27 15:52:53 -05:00
parent e42a2326e8
commit 1b9f4bd922
16 changed files with 1357 additions and 59 deletions
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@

 `seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order.

-The current implementation supports the `merge` command. It reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, assigns consecutive numeric `id` values, and writes a merged JSON artifact.
+The current implementation supports the `merge` command. It reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact.

 ## Usage

@@ -56,7 +56,11 @@ Each input file must be valid JSON with a top-level `segments` array. The curren
    {
      "start": 1.25,
      "end": 3.5,
-      "text": "Hello there."
+      "text": "Hello there.",
+      "words": [
+        {"word": "Hello", "start": 1.25, "end": 1.55, "score": 0.98},
+        {"word": "there.", "start": 1.7, "end": 2.0}
+      ]
    }
  ]
 }
@@ -68,7 +72,16 @@ Required segment fields:
 - `end`: number, must be `>= start`.
 - `text`: string.

-Other WhisperX fields, including `words` and raw diarization speaker labels, are ignored for now.
+Optional word fields:
+
+- `words`: array of word timing objects.
+- `words[].word`: string.
+- `words[].start`: optional number, must be `>= 0` when present.
+- `words[].end`: optional number, must be `>= start` when present with `start`.
+- `words[].score`: optional number.
+- `words[].speaker`: optional raw speaker label string.
+
+Word-level timing is preserved internally for overlap resolution. If a word is missing `start` or `end`, seriatim keeps the word text, emits a warning in the optional report, and does not use that word as a timing anchor. Word timing is not emitted in the final JSON artifact.

 ## Speaker Map Format

@@ -150,6 +163,16 @@ The merged output uses the current seriatim envelope:
      "end": 3.5,
      "text": "Hello there.",
      "overlap_group_id": 1
+    },
+    {
+      "id": 2,
+      "source": "eric.json",
+      "source_ref": "word-run:1:1:1",
+      "derived_from": ["eric.json#0"],
+      "speaker": "Eric Rakestraw",
+      "start": 2.0,
+      "end": 2.5,
+      "text": "Resolved word run"
    }
  ],
  "overlap_groups": [
@@ -169,7 +192,7 @@ The merged output uses the current seriatim envelope:
 Segments are sorted deterministically by:

 ```text
-(start, end, source, source_segment_index, speaker)
+(start, end, source, source_segment_index/source_ref, speaker)
 ```

 Final segment IDs are assigned after sorting and start at `1`.
@@ -187,7 +210,27 @@ Overlap behavior:
 - Segments in detected groups receive `overlap_group_id`.
 - `overlap_groups[].segments` contains stable references in `source#source_segment_index` format.
 - `class` is currently `unknown`.
- `resolution` is currently `unresolved`; overlap resolution is still a no-op.
+- `resolution` is `unresolved` until `resolve-overlaps` replaces the group.
+
+## Overlap Resolution
+
+The default postprocessing pipeline runs `resolve-overlaps` after `detect-overlaps`.
+
+For each detected overlap group, `resolve-overlaps` uses preserved WhisperX word timing to build smaller word-run replacement segments:
+
+- Words are included when their interval intersects the overlap window: `word.end > group.start && word.start < group.end`.
+- Untimed words are included in replacement text in original word order when nearby timed words create a replacement run.
+- Untimed words do not affect replacement segment start/end times or word-run gap splitting.
+- Words for the same speaker are merged into one run when the gap between adjacent words is no greater than `SERIATIM_OVERLAP_WORD_RUN_GAP`.
+- The default word-run gap is `0.75` seconds.
+- Set `SERIATIM_OVERLAP_WORD_RUN_GAP` to a positive number of seconds to override the default.
+- Replacement segment text is built by joining word text with single spaces.
+- Replacement segments include `source_ref` and `derived_from`.
+- Replacement segments omit `source_segment_index` because they are derived from one or more original segments.
+- Resolved overlap groups are removed from `overlap_groups`.
+- Replacement segments are left without `overlap_group_id`; future passes can detect any remaining overlap.
+- If a speaker has no usable word timing in a group, that speaker's original segment is kept.
+- If no speakers in a group have usable word timing, the original group and annotations remain unchanged.

 ## Autocorrect

@@ -227,6 +270,5 @@ Matching behavior:
 ## Current Limitations

 - Only JSON input is supported.
- Word-level timing data is not preserved yet.
- Overlap resolution is currently a no-op module.
+- Overlap resolution depends on WhisperX word timing; groups without usable word timing remain unresolved.
 - Coalescing and alternate output formats are not implemented yet.