Updated documentation to reflect the current CLI interface
This commit is contained in:
3
.gitignore
vendored
3
.gitignore
vendored
@@ -27,6 +27,7 @@ go.work.sum
|
||||
|
||||
# Binaries for this application
|
||||
cmd/seriatim/seriatim
|
||||
seriatim
|
||||
|
||||
# Sample transcripts for testing
|
||||
samples/
|
||||
samples/
|
||||
|
||||
136
README.md
136
README.md
@@ -1,3 +1,137 @@
|
||||
# seriatim
|
||||
|
||||
Seriatim merges per-speaker whisperx transcripts into a single output transcript that preserves speaker identity and chronological order.
|
||||
`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order.
|
||||
|
||||
The current implementation supports the `merge` command. It reads one or more input JSON files, maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, assigns consecutive numeric `id` values, and writes a merged JSON artifact.
|
||||
|
||||
## Usage
|
||||
|
||||
Run from source:
|
||||
|
||||
```sh
|
||||
go run ./cmd/seriatim merge \
|
||||
--input-file samples/raw/2026-04-19-Eric_Rakestraw.json \
|
||||
--input-file samples/raw/2026-04-19-Mike_Brown.json \
|
||||
--speakers samples/speakers.yml \
|
||||
--output-file merged.json
|
||||
```
|
||||
|
||||
Optional report output:
|
||||
|
||||
```sh
|
||||
go run ./cmd/seriatim merge \
|
||||
--input-file eric.json \
|
||||
--input-file mike.json \
|
||||
--speakers speakers.yml \
|
||||
--output-file merged.json \
|
||||
--report-file report.json
|
||||
```
|
||||
|
||||
## CLI
|
||||
|
||||
```text
|
||||
seriatim merge [flags]
|
||||
```
|
||||
|
||||
Required flags for the default pipeline:
|
||||
|
||||
- `--input-file`: input transcript JSON file. Repeat once per speaker/input file.
|
||||
- `--speakers`: speaker map YAML file. Required because `normalize-speakers` is enabled by default.
|
||||
- `--output-file`: merged transcript JSON output path.
|
||||
|
||||
Optional flags:
|
||||
|
||||
- `--report-file`: write a JSON report with pipeline events.
|
||||
- `--input-reader`: input reader module. Default: `json-files`.
|
||||
- `--output-modules`: comma-separated output modules. Default: `json`.
|
||||
- `--preprocessing-modules`: comma-separated preprocessing modules. Default: `validate-raw,normalize-speakers,trim-text`.
|
||||
- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,assign-ids,validate-output`.
|
||||
- `--autocorrect`: autocorrect rules file. Reserved for the `autocorrect` module; not part of the default pipeline.
|
||||
|
||||
## Input JSON Format
|
||||
|
||||
Each input file must be valid JSON with a top-level `segments` array. The current parser accepts the WhisperX segment subset needed for merging:
|
||||
|
||||
```json
|
||||
{
|
||||
"segments": [
|
||||
{
|
||||
"start": 1.25,
|
||||
"end": 3.5,
|
||||
"text": "Hello there."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Required segment fields:
|
||||
|
||||
- `start`: number, must be `>= 0`.
|
||||
- `end`: number, must be `>= start`.
|
||||
- `text`: string.
|
||||
|
||||
Other WhisperX fields, including `words` and raw diarization speaker labels, are ignored for now.
|
||||
|
||||
## Speaker Map Format
|
||||
|
||||
`speakers.yml` maps each input file basename to one canonical speaker name:
|
||||
|
||||
```yaml
|
||||
inputs:
|
||||
2026-04-19-Eric_Rakestraw.json:
|
||||
speaker: "Eric Rakestraw"
|
||||
|
||||
2026-04-19-Mike_Brown.json:
|
||||
speaker: "Mike Brown"
|
||||
```
|
||||
|
||||
Important details:
|
||||
|
||||
- Keys are matched against the basename of each `--input-file`, not the full path.
|
||||
- Every input file must have exactly one matching entry.
|
||||
- `speaker` is required and must be non-empty.
|
||||
|
||||
## Output JSON Format
|
||||
|
||||
The merged output uses the current seriatim envelope:
|
||||
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"application": "seriatim",
|
||||
"version": "dev",
|
||||
"input_reader": "json-files",
|
||||
"input_files": ["eric.json", "mike.json"],
|
||||
"preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
|
||||
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "assign-ids", "validate-output"],
|
||||
"output_modules": ["json"]
|
||||
},
|
||||
"segments": [
|
||||
{
|
||||
"id": 1,
|
||||
"source": "eric.json",
|
||||
"source_segment_index": 0,
|
||||
"speaker": "Eric Rakestraw",
|
||||
"start": 1.25,
|
||||
"end": 3.5,
|
||||
"text": "Hello there."
|
||||
}
|
||||
],
|
||||
"overlap_groups": []
|
||||
}
|
||||
```
|
||||
|
||||
Segments are sorted deterministically by:
|
||||
|
||||
```text
|
||||
(start, end, source, source_segment_index, speaker)
|
||||
```
|
||||
|
||||
Final segment IDs are assigned after sorting and start at `1`.
|
||||
|
||||
## Current Limitations
|
||||
|
||||
- Only JSON input is supported.
|
||||
- Word-level timing data is not preserved yet.
|
||||
- Overlap detection and overlap resolution are currently no-op modules.
|
||||
- Autocorrect, coalescing, and alternate output formats are not implemented yet.
|
||||
|
||||
Reference in New Issue
Block a user