Update the default postprocessing pipeline to run detect-overlaps twice
This commit is contained in:
10
README.md
10
README.md
@@ -44,7 +44,7 @@ Optional flags:
|
||||
- `--input-reader`: input reader module. Default: `json-files`.
|
||||
- `--output-modules`: comma-separated output modules. Default: `json`.
|
||||
- `--preprocessing-modules`: comma-separated preprocessing modules. Default: `validate-raw,normalize-speakers,trim-text`.
|
||||
- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,autocorrect,assign-ids,validate-output`.
|
||||
- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,detect-overlaps,autocorrect,assign-ids,validate-output`.
|
||||
|
||||
## Input JSON Format
|
||||
|
||||
@@ -150,7 +150,7 @@ The merged output uses the current seriatim envelope:
|
||||
"input_reader": "json-files",
|
||||
"input_files": ["eric.json", "mike.json"],
|
||||
"preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
|
||||
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "autocorrect", "assign-ids", "validate-output"],
|
||||
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
|
||||
"output_modules": ["json"]
|
||||
},
|
||||
"segments": [
|
||||
@@ -214,7 +214,7 @@ Overlap behavior:
|
||||
|
||||
## Overlap Resolution
|
||||
|
||||
The default postprocessing pipeline runs `resolve-overlaps` after `detect-overlaps`.
|
||||
The default postprocessing pipeline runs `detect-overlaps`, then `resolve-overlaps`, then a second `detect-overlaps` pass.
|
||||
|
||||
For each detected overlap group, `resolve-overlaps` uses preserved WhisperX word timing to build smaller word-run replacement segments:
|
||||
|
||||
@@ -227,8 +227,8 @@ For each detected overlap group, `resolve-overlaps` uses preserved WhisperX word
|
||||
- Replacement segment text is built by joining word text with single spaces.
|
||||
- Replacement segments include `source_ref` and `derived_from`.
|
||||
- Replacement segments omit `source_segment_index` because they are derived from one or more original segments.
|
||||
- Resolved overlap groups are removed from `overlap_groups`.
|
||||
- Replacement segments are left without `overlap_group_id`; future passes can detect any remaining overlap.
|
||||
- Resolved overlap groups are removed before the second detection pass.
|
||||
- Replacement segments are left without `overlap_group_id` until the second detection pass annotates any remaining overlap.
|
||||
- If a speaker has no usable word timing in a group, that speaker's original segment is kept.
|
||||
- If no speakers in a group have usable word timing, the original group and annotations remain unchanged.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user