Implemented a new internal/danglers package with deterministic two-pass dangling-end then dangling-start resolution

This commit is contained in:
2026-04-28 15:38:16 -05:00
parent 47b6727973
commit f1ce35dfc3
8 changed files with 602 additions and 8 deletions

View File

@@ -51,7 +51,7 @@ Global flags:
| `--output-modules` | No | `json` | Comma-separated output modules. |
| `--output-schema` | No | `seriatim` | JSON output contract. Allowed values are `seriatim` and `minimal`. |
| `--preprocessing-modules` | No | `validate-raw,normalize-speakers,trim-text` | Comma-separated preprocessing modules, evaluated in order. |
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,coalesce,resolve-danglers,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
| `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |
Environment variables:
@@ -169,7 +169,7 @@ The default `seriatim` schema uses the full seriatim envelope:
"input_reader": "json-files",
"input_files": ["eric.json", "mike.json"],
"preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "coalesce", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "coalesce", "resolve-danglers", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
"output_modules": ["json"]
},
"segments": [
@@ -265,7 +265,7 @@ Overlap behavior:
## Overlap Resolution
The default postprocessing pipeline runs `detect-overlaps`, then `resolve-overlaps`, then `backchannel`, then `filler`, then `coalesce`, then a second `detect-overlaps` pass.
The default postprocessing pipeline runs `detect-overlaps`, then `resolve-overlaps`, then `backchannel`, then `filler`, then `coalesce`, then `resolve-danglers`, then a second `detect-overlaps` pass.
For each detected overlap group, `resolve-overlaps` uses preserved WhisperX word timing to build smaller word-run replacement segments:
@@ -311,12 +311,21 @@ Filler matching is case-insensitive, ignores punctuation for matching and word-c
## Coalescing
The default pipeline runs `coalesce` before the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when `next.start - current.end <= --coalesce-gap`.
The default pipeline runs `coalesce` before `resolve-danglers` and the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when `next.start - current.end <= --coalesce-gap`.
Coalesced segments use `source_ref` values such as `coalesce:1`, include `derived_from`, and omit `source_segment_index`.
Different-speaker backchannel and filler segments do not block coalescing of surrounding same-speaker segments. Same-speaker backchannel and filler segments are merged normally when they are within `--coalesce-gap`. When same-speaker segments are coalesced, any `backchannel` or `filler` category from the merged inputs is dropped from the coalesced segment.
## Dangler Resolution
The default pipeline runs `resolve-danglers` after `coalesce` and before the second overlap detection pass. It repairs short derived fragments when they share provenance with a nearby segment:
- Dangling-end fragments have no more than two words and end in punctuation.
- Dangling-start fragments have no more than two words.
- Matching uses any shared `derived_from` value.
- Merged segments use `source_ref` values such as `resolve-danglers:1`, keep the target segment's transcript position, and union `derived_from`.
## Autocorrect
Autocorrect is included in the default postprocessing pipeline. If `--autocorrect` is omitted, the module leaves transcript text unchanged and records a skip event in the optional report.