Implemented an autocorrect module at the postprocessing stage

This commit is contained in:
2026-04-26 19:33:23 -05:00
parent 99d0c425d6
commit 3928e0c4a7
7 changed files with 482 additions and 6 deletions

View File

@@ -46,7 +46,7 @@ Optional flags:
- `--output-modules`: comma-separated output modules. Default: `json`.
- `--preprocessing-modules`: comma-separated preprocessing modules. Default: `validate-raw,normalize-speakers,trim-text`.
- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,assign-ids,validate-output`.
- `--autocorrect`: autocorrect rules file. Reserved for the `autocorrect` module; not part of the default pipeline.
- `--autocorrect`: autocorrect rules file. Required when the postprocessing `autocorrect` module is enabled.
## Input JSON Format
@@ -163,9 +163,46 @@ Segments are sorted deterministically by:
Final segment IDs are assigned after sorting and start at `1`.
## Autocorrect
Autocorrect is an opt-in postprocessing module. It is not part of the default pipeline.
Enable it by adding `autocorrect` to `--postprocessing-modules` and passing `--autocorrect`:
```sh
go run ./cmd/seriatim merge \
--input-file input.json \
--speakers speakers.yml \
--autocorrect autocorrect.yml \
--postprocessing-modules detect-overlaps,resolve-overlaps,autocorrect,assign-ids,validate-output \
--output-file merged.json
```
`autocorrect.yml` format:
```yaml
autocorrect:
- target: "Hrank"
match:
- "hrank"
- "Frank"
- target: "Mike Brown"
match:
- "Mike Pat"
```
Matching behavior:
- Matching is case-sensitive.
- Matches apply only to whole tokens, not substrings inside larger words.
- Punctuation and whitespace can surround a match.
- Multi-word and hyphenated matches are supported.
- Duplicate match strings are invalid, including duplicates across separate rules.
## Current Limitations
- Only JSON input is supported.
- Word-level timing data is not preserved yet.
- Overlap detection and overlap resolution are currently no-op modules.
- Autocorrect, coalescing, and alternate output formats are not implemented yet.
- Coalescing and alternate output formats are not implemented yet.