diff --git a/.gitignore b/.gitignore index b0ec19c..ab324e4 100644 --- a/.gitignore +++ b/.gitignore @@ -27,6 +27,7 @@ go.work.sum # Binaries for this application cmd/seriatim/seriatim +seriatim # Sample transcripts for testing -samples/ \ No newline at end of file +samples/ diff --git a/README.md b/README.md index 8b37cff..118a6f9 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,137 @@ # seriatim -Seriatim merges per-speaker whisperx transcripts into a single output transcript that preserves speaker identity and chronological order. \ No newline at end of file +`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. + +The current implementation supports the `merge` command. It reads one or more input JSON files, maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, assigns consecutive numeric `id` values, and writes a merged JSON artifact. + +## Usage + +Run from source: + +```sh +go run ./cmd/seriatim merge \ + --input-file samples/raw/2026-04-19-Eric_Rakestraw.json \ + --input-file samples/raw/2026-04-19-Mike_Brown.json \ + --speakers samples/speakers.yml \ + --output-file merged.json +``` + +Optional report output: + +```sh +go run ./cmd/seriatim merge \ + --input-file eric.json \ + --input-file mike.json \ + --speakers speakers.yml \ + --output-file merged.json \ + --report-file report.json +``` + +## CLI + +```text +seriatim merge [flags] +``` + +Required flags for the default pipeline: + +- `--input-file`: input transcript JSON file. Repeat once per speaker/input file. +- `--speakers`: speaker map YAML file. Required because `normalize-speakers` is enabled by default. +- `--output-file`: merged transcript JSON output path. + +Optional flags: + +- `--report-file`: write a JSON report with pipeline events. +- `--input-reader`: input reader module. Default: `json-files`. +- `--output-modules`: comma-separated output modules. Default: `json`. +- `--preprocessing-modules`: comma-separated preprocessing modules. Default: `validate-raw,normalize-speakers,trim-text`. +- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,assign-ids,validate-output`. +- `--autocorrect`: autocorrect rules file. Reserved for the `autocorrect` module; not part of the default pipeline. + +## Input JSON Format + +Each input file must be valid JSON with a top-level `segments` array. The current parser accepts the WhisperX segment subset needed for merging: + +```json +{ + "segments": [ + { + "start": 1.25, + "end": 3.5, + "text": "Hello there." + } + ] +} +``` + +Required segment fields: + +- `start`: number, must be `>= 0`. +- `end`: number, must be `>= start`. +- `text`: string. + +Other WhisperX fields, including `words` and raw diarization speaker labels, are ignored for now. + +## Speaker Map Format + +`speakers.yml` maps each input file basename to one canonical speaker name: + +```yaml +inputs: + 2026-04-19-Eric_Rakestraw.json: + speaker: "Eric Rakestraw" + + 2026-04-19-Mike_Brown.json: + speaker: "Mike Brown" +``` + +Important details: + +- Keys are matched against the basename of each `--input-file`, not the full path. +- Every input file must have exactly one matching entry. +- `speaker` is required and must be non-empty. + +## Output JSON Format + +The merged output uses the current seriatim envelope: + +```json +{ + "metadata": { + "application": "seriatim", + "version": "dev", + "input_reader": "json-files", + "input_files": ["eric.json", "mike.json"], + "preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"], + "postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "assign-ids", "validate-output"], + "output_modules": ["json"] + }, + "segments": [ + { + "id": 1, + "source": "eric.json", + "source_segment_index": 0, + "speaker": "Eric Rakestraw", + "start": 1.25, + "end": 3.5, + "text": "Hello there." + } + ], + "overlap_groups": [] +} +``` + +Segments are sorted deterministically by: + +```text +(start, end, source, source_segment_index, speaker) +``` + +Final segment IDs are assigned after sorting and start at `1`. + +## Current Limitations + +- Only JSON input is supported. +- Word-level timing data is not preserved yet. +- Overlap detection and overlap resolution are currently no-op modules. +- Autocorrect, coalescing, and alternate output formats are not implemented yet.