Updated documentation to reflect the current CLI interface

2026-04-26 18:43:43 -05:00
parent 18f1873776
commit fe00600762
2 changed files with 137 additions and 2 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -27,6 +27,7 @@ go.work.sum

 # Binaries for this application
 cmd/seriatim/seriatim
+seriatim

 # Sample transcripts for testing
-samples/
+samples/
--- a/README.md
+++ b/README.md
@@ -1,3 +1,137 @@
 # seriatim

-Seriatim merges per-speaker whisperx transcripts into a single output transcript that preserves speaker identity and chronological order.
+`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order.
+
+The current implementation supports the `merge` command. It reads one or more input JSON files, maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, assigns consecutive numeric `id` values, and writes a merged JSON artifact.
+
+## Usage
+
+Run from source:
+
+```sh
+go run ./cmd/seriatim merge \
+  --input-file samples/raw/2026-04-19-Eric_Rakestraw.json \
+  --input-file samples/raw/2026-04-19-Mike_Brown.json \
+  --speakers samples/speakers.yml \
+  --output-file merged.json
+```
+
+Optional report output:
+
+```sh
+go run ./cmd/seriatim merge \
+  --input-file eric.json \
+  --input-file mike.json \
+  --speakers speakers.yml \
+  --output-file merged.json \
+  --report-file report.json
+```
+
+## CLI
+
+```text
+seriatim merge [flags]
+```
+
+Required flags for the default pipeline:
+
+- `--input-file`: input transcript JSON file. Repeat once per speaker/input file.
+- `--speakers`: speaker map YAML file. Required because `normalize-speakers` is enabled by default.
+- `--output-file`: merged transcript JSON output path.
+
+Optional flags:
+
+- `--report-file`: write a JSON report with pipeline events.
+- `--input-reader`: input reader module. Default: `json-files`.
+- `--output-modules`: comma-separated output modules. Default: `json`.
+- `--preprocessing-modules`: comma-separated preprocessing modules. Default: `validate-raw,normalize-speakers,trim-text`.
+- `--postprocessing-modules`: comma-separated postprocessing modules. Default: `detect-overlaps,resolve-overlaps,assign-ids,validate-output`.
+- `--autocorrect`: autocorrect rules file. Reserved for the `autocorrect` module; not part of the default pipeline.
+
+## Input JSON Format
+
+Each input file must be valid JSON with a top-level `segments` array. The current parser accepts the WhisperX segment subset needed for merging:
+
+```json
+{
+  "segments": [
+    {
+      "start": 1.25,
+      "end": 3.5,
+      "text": "Hello there."
+    }
+  ]
+}
+```
+
+Required segment fields:
+
+- `start`: number, must be `>= 0`.
+- `end`: number, must be `>= start`.
+- `text`: string.
+
+Other WhisperX fields, including `words` and raw diarization speaker labels, are ignored for now.
+
+## Speaker Map Format
+
+`speakers.yml` maps each input file basename to one canonical speaker name:
+
+```yaml
+inputs:
+  2026-04-19-Eric_Rakestraw.json:
+    speaker: "Eric Rakestraw"
+
+  2026-04-19-Mike_Brown.json:
+    speaker: "Mike Brown"
+```
+
+Important details:
+
+- Keys are matched against the basename of each `--input-file`, not the full path.
+- Every input file must have exactly one matching entry.
+- `speaker` is required and must be non-empty.
+
+## Output JSON Format
+
+The merged output uses the current seriatim envelope:
+
+```json
+{
+  "metadata": {
+    "application": "seriatim",
+    "version": "dev",
+    "input_reader": "json-files",
+    "input_files": ["eric.json", "mike.json"],
+    "preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
+    "postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "assign-ids", "validate-output"],
+    "output_modules": ["json"]
+  },
+  "segments": [
+    {
+      "id": 1,
+      "source": "eric.json",
+      "source_segment_index": 0,
+      "speaker": "Eric Rakestraw",
+      "start": 1.25,
+      "end": 3.5,
+      "text": "Hello there."
+    }
+  ],
+  "overlap_groups": []
+}
+```
+
+Segments are sorted deterministically by:
+
+```text
+(start, end, source, source_segment_index, speaker)
+```
+
+Final segment IDs are assigned after sorting and start at `1`.
+
+## Current Limitations
+
+- Only JSON input is supported.
+- Word-level timing data is not preserved yet.
+- Overlap detection and overlap resolution are currently no-op modules.
+- Autocorrect, coalescing, and alternate output formats are not implemented yet.