Bugfixes and documentation cleanup for v1.0 release.

2026-05-01 11:30:29 -05:00
parent c9e98e14b5
commit f20f06db12
17 changed files with 332 additions and 177 deletions
--- a/README.md
+++ b/README.md
@@ -49,7 +49,7 @@ Global flags:
 | `--autocorrect` | No | none | Autocorrect rules YAML file. When omitted, the default `autocorrect` module leaves text unchanged. |
 | `--input-reader` | No | `json-files` | Input reader module. |
 | `--output-modules` | No | `json` | Comma-separated output modules. |
-| `--output-schema` | No | `default` | JSON output contract. Allowed values are `default`, `minimal`, and `seriatim`. |
+| `--output-schema` | No | `seriatim-intermediate` | JSON output contract. Allowed values are `seriatim-minimal`, `seriatim-intermediate`, and `seriatim-full`. If omitted, the runtime default is used; consumers that depend on a specific shape should set this explicitly. |
 | `--preprocessing-modules` | No | `validate-raw,normalize-speakers,trim-text` | Comma-separated preprocessing modules, evaluated in order. |
 | `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
 | `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |
@@ -58,7 +58,8 @@ Environment variables:

 | Environment Variable | Default | Description |
 | --- | --- | --- |
-| `SERIATIM_OVERLAP_WORD_RUN_GAP` | `0.75` | Maximum gap in seconds between adjacent timed words when `resolve-overlaps` builds word-run replacement segments. Must be a positive float. |
+| `SERIATIM_OUTPUT_SCHEMA` | `seriatim-intermediate` | Output schema used when `--output-schema` is not explicitly provided. Allowed values are `seriatim-minimal`, `seriatim-intermediate`, and `seriatim-full`. The CLI flag takes precedence. |
+| `SERIATIM_OVERLAP_WORD_RUN_GAP` | `1.0` | Maximum gap in seconds between adjacent timed words when `resolve-overlaps` builds word-run replacement segments. Must be a positive float. |
 | `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW` | `1.0` | Near-start window in seconds for ordering replacement word runs shortest-first. Must be a positive float. |
 | `SERIATIM_BACKCHANNEL_MAX_DURATION` | `2.0` | Maximum duration in seconds for `backchannel` classification. Must be a positive float. |
 | `SERIATIM_FILLER_MAX_DURATION` | `1.25` | Maximum duration in seconds for `filler` classification. Must be a positive float. |
@@ -159,14 +160,16 @@ The old `inputs:` direct mapping format is no longer supported.

 `--output-modules json` controls the writer. `--output-schema` controls the JSON contract that writer serializes.

-The `default` schema is the default output contract. It stays close to `minimal`, but adds optional `categories` on each segment:
+The named schemas are stable public contracts. If a consumer depends on a specific shape, it should request that schema explicitly at runtime. The runtime default selection may change in a future release.
+
+The `seriatim-intermediate` schema is the current default selection when neither `--output-schema` nor `SERIATIM_OUTPUT_SCHEMA` is set. It stays close to the minimal schema, but adds optional `categories` on each segment:

 ```json
 {
  "metadata": {
    "application": "seriatim",
    "version": "dev",
-    "output_schema": "default"
+    "output_schema": "seriatim-intermediate"
  },
  "segments": [
    {
@@ -181,7 +184,7 @@ The `default` schema is the default output contract. It stays close to `minimal`
 }
 ```

-The explicit `seriatim` schema uses the full seriatim envelope:
+The `seriatim-full` schema uses the full seriatim envelope:

 ```json
 {
@@ -191,7 +194,7 @@ The explicit `seriatim` schema uses the full seriatim envelope:
    "input_reader": "json-files",
    "input_files": ["eric.json", "mike.json"],
    "preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
-    "postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "coalesce", "resolve-danglers", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
+    "postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "resolve-danglers", "coalesce", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
    "output_modules": ["json"]
  },
  "segments": [
@@ -231,14 +234,14 @@ The explicit `seriatim` schema uses the full seriatim envelope:
 }
 ```

-The `minimal` schema emits minimal metadata and compact ordered segments:
+The `seriatim-minimal` schema emits minimal metadata and compact ordered segments:

 ```json
 {
  "metadata": {
    "application": "seriatim",
    "version": "dev",
-    "output_schema": "minimal"
+    "output_schema": "seriatim-minimal"
  },
  "segments": [
    {
@@ -254,7 +257,7 @@ The `minimal` schema emits minimal metadata and compact ordered segments:

 Minimal output intentionally omits categories, overlap groups, source/provenance fields, and pipeline configuration metadata.

-Default output intentionally omits overlap groups and source/provenance fields, but keeps optional `categories` and minimal metadata.
+Intermediate output intentionally omits overlap groups and source/provenance fields, but keeps optional `categories` and minimal metadata.

 Segments are sorted deterministically by:

@@ -270,7 +273,7 @@ The public Go output contract is available from:
 import "gitea.maximumdirect.net/eric/seriatim/schema"
 ```

-The same package embeds machine-readable JSON Schemas in `schema/output.schema.json`, `schema/default-output.schema.json`, and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.
+The same package embeds machine-readable JSON Schemas in `schema/full-output.schema.json`, `schema/intermediate-output.schema.json`, and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.

 ## Overlap Detection

@@ -295,12 +298,12 @@ For each detected overlap group, `resolve-overlaps` uses preserved WhisperX word

 - The resolution window expands the detected overlap group by `--coalesce-gap` seconds on both sides.
 - Nearby same-speaker context segments are included when they intersect the expanded window and their start or end is within `--coalesce-gap` of the original overlap boundary.
- Words are included when their interval intersects the expanded resolution window.
+- Once a segment is selected for replacement, all timed words from that segment participate in word-run construction; the window controls segment selection, not per-word clipping.
 - Context segments that are part of another detected overlap group are not pulled into the current group.
 - Untimed words are included in replacement text in original word order when nearby timed words create a replacement run.
 - Untimed words do not affect replacement segment start/end times or word-run gap splitting.
 - Words for the same speaker are merged into one run when the gap between adjacent words is no greater than `SERIATIM_OVERLAP_WORD_RUN_GAP`.
- The default word-run gap is `0.75` seconds.
+- The default word-run gap is `1.0` seconds.
 - Set `SERIATIM_OVERLAP_WORD_RUN_GAP` to a positive number of seconds to override the default.
 - Near-start replacement word runs are reordered so shorter segments come first when adjacent starts are within `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW`.
 - The default word-run reorder window is `1.0` seconds.
@@ -339,12 +342,12 @@ The default pipeline runs `resolve-danglers` before `coalesce` and before the se

 - Dangling-end fragments have no more than two words and end in punctuation.
 - Dangling-start fragments have no more than two words.
- Matching uses any shared `derived_from` value.
+- Matching uses same-speaker segments with any shared `derived_from` value.
 - Merged segments use `source_ref` values such as `resolve-danglers:1`, keep the target segment's transcript position, and union `derived_from`.

 ## Coalescing

-The default pipeline runs `coalesce` after `resolve-danglers` and the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when `next.start - current.end <= --coalesce-gap`.
+The default pipeline runs `coalesce` after `resolve-danglers` and before the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when `next.start - current.end <= --coalesce-gap`.

 Coalesced segments use `source_ref` values such as `coalesce:1`, include `derived_from`, and omit `source_segment_index`.