Implemented an autocorrect module at the postprocessing stage
This commit is contained in:
41
README.md
41
README.md
@@ -46,7 +46,7 @@ Optional flags:
|
||||
- `--output-modules`: comma-separated output modules. Default: `json`.
|
||||
- `--preprocessing-modules`: comma-separated preprocessing modules. Default: `validate-raw,normalize-speakers,trim-text`.
|
||||
- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,assign-ids,validate-output`.
|
||||
- `--autocorrect`: autocorrect rules file. Reserved for the `autocorrect` module; not part of the default pipeline.
|
||||
- `--autocorrect`: autocorrect rules file. Required when the postprocessing `autocorrect` module is enabled.
|
||||
|
||||
## Input JSON Format
|
||||
|
||||
@@ -163,9 +163,46 @@ Segments are sorted deterministically by:
|
||||
|
||||
Final segment IDs are assigned after sorting and start at `1`.
|
||||
|
||||
## Autocorrect
|
||||
|
||||
Autocorrect is an opt-in postprocessing module. It is not part of the default pipeline.
|
||||
|
||||
Enable it by adding `autocorrect` to `--postprocessing-modules` and passing `--autocorrect`:
|
||||
|
||||
```sh
|
||||
go run ./cmd/seriatim merge \
|
||||
--input-file input.json \
|
||||
--speakers speakers.yml \
|
||||
--autocorrect autocorrect.yml \
|
||||
--postprocessing-modules detect-overlaps,resolve-overlaps,autocorrect,assign-ids,validate-output \
|
||||
--output-file merged.json
|
||||
```
|
||||
|
||||
`autocorrect.yml` format:
|
||||
|
||||
```yaml
|
||||
autocorrect:
|
||||
- target: "Hrank"
|
||||
match:
|
||||
- "hrank"
|
||||
- "Frank"
|
||||
|
||||
- target: "Mike Brown"
|
||||
match:
|
||||
- "Mike Pat"
|
||||
```
|
||||
|
||||
Matching behavior:
|
||||
|
||||
- Matching is case-sensitive.
|
||||
- Matches apply only to whole tokens, not substrings inside larger words.
|
||||
- Punctuation and whitespace can surround a match.
|
||||
- Multi-word and hyphenated matches are supported.
|
||||
- Duplicate match strings are invalid, including duplicates across separate rules.
|
||||
|
||||
## Current Limitations
|
||||
|
||||
- Only JSON input is supported.
|
||||
- Word-level timing data is not preserved yet.
|
||||
- Overlap detection and overlap resolution are currently no-op modules.
|
||||
- Autocorrect, coalescing, and alternate output formats are not implemented yet.
|
||||
- Coalescing and alternate output formats are not implemented yet.
|
||||
|
||||
Reference in New Issue
Block a user