Simplify the CLI interface and update documentation accordingly
This commit is contained in:
20
README.md
20
README.md
@@ -2,7 +2,7 @@
|
||||
|
||||
`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order.
|
||||
|
||||
The current implementation supports the `merge` command. It reads one or more input JSON files, maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, assigns consecutive numeric `id` values, and writes a merged JSON artifact.
|
||||
The current implementation supports the `merge` command. It reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, assigns consecutive numeric `id` values, and writes a merged JSON artifact.
|
||||
|
||||
## Usage
|
||||
|
||||
@@ -12,7 +12,6 @@ Run from source:
|
||||
go run ./cmd/seriatim merge \
|
||||
--input-file samples/raw/2026-04-19-Eric_Rakestraw.json \
|
||||
--input-file samples/raw/2026-04-19-Mike_Brown.json \
|
||||
--speakers samples/speakers.yml \
|
||||
--output-file merged.json
|
||||
```
|
||||
|
||||
@@ -22,7 +21,6 @@ Optional report output:
|
||||
go run ./cmd/seriatim merge \
|
||||
--input-file eric.json \
|
||||
--input-file mike.json \
|
||||
--speakers speakers.yml \
|
||||
--output-file merged.json \
|
||||
--report-file report.json
|
||||
```
|
||||
@@ -36,17 +34,17 @@ seriatim merge [flags]
|
||||
Required flags for the default pipeline:
|
||||
|
||||
- `--input-file`: input transcript JSON file. Repeat once per speaker/input file.
|
||||
- `--speakers`: speaker map YAML file. Required because `normalize-speakers` is enabled by default.
|
||||
- `--output-file`: merged transcript JSON output path.
|
||||
|
||||
Optional flags:
|
||||
|
||||
- `--report-file`: write a JSON report with pipeline events.
|
||||
- `--speakers`: speaker map YAML file. When omitted, input file basenames are used as speaker labels.
|
||||
- `--autocorrect`: autocorrect rules file. When omitted, the default `autocorrect` module no-ops.
|
||||
- `--input-reader`: input reader module. Default: `json-files`.
|
||||
- `--output-modules`: comma-separated output modules. Default: `json`.
|
||||
- `--preprocessing-modules`: comma-separated preprocessing modules. Default: `validate-raw,normalize-speakers,trim-text`.
|
||||
- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,assign-ids,validate-output`.
|
||||
- `--autocorrect`: autocorrect rules file. Required when the postprocessing `autocorrect` module is enabled.
|
||||
- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,autocorrect,assign-ids,validate-output`.
|
||||
|
||||
## Input JSON Format
|
||||
|
||||
@@ -76,6 +74,8 @@ Other WhisperX fields, including `words` and raw diarization speaker labels, are
|
||||
|
||||
`speakers.yml` maps input files to canonical speaker names using ordered substring rules:
|
||||
|
||||
This file is optional. If `--speakers` is omitted, `seriatim` uses each input file basename as the segment speaker label.
|
||||
|
||||
```yaml
|
||||
match:
|
||||
- speaker: "Eric Rakestraw"
|
||||
@@ -137,7 +137,7 @@ The merged output uses the current seriatim envelope:
|
||||
"input_reader": "json-files",
|
||||
"input_files": ["eric.json", "mike.json"],
|
||||
"preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
|
||||
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "assign-ids", "validate-output"],
|
||||
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "autocorrect", "assign-ids", "validate-output"],
|
||||
"output_modules": ["json"]
|
||||
},
|
||||
"segments": [
|
||||
@@ -165,16 +165,14 @@ Final segment IDs are assigned after sorting and start at `1`.
|
||||
|
||||
## Autocorrect
|
||||
|
||||
Autocorrect is an opt-in postprocessing module. It is not part of the default pipeline.
|
||||
Autocorrect is included in the default postprocessing pipeline. If `--autocorrect` is omitted, the module leaves transcript text unchanged and records a skip event in the optional report.
|
||||
|
||||
Enable it by adding `autocorrect` to `--postprocessing-modules` and passing `--autocorrect`:
|
||||
Enable corrections by passing `--autocorrect`:
|
||||
|
||||
```sh
|
||||
go run ./cmd/seriatim merge \
|
||||
--input-file input.json \
|
||||
--speakers speakers.yml \
|
||||
--autocorrect autocorrect.yml \
|
||||
--postprocessing-modules detect-overlaps,resolve-overlaps,autocorrect,assign-ids,validate-output \
|
||||
--output-file merged.json
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user