Bugfixes and documentation cleanup for v1.0 release.
All checks were successful
ci/woodpecker/tag/release Pipeline was successful

This commit is contained in:
2026-05-01 11:30:29 -05:00
parent c9e98e14b5
commit f20f06db12
17 changed files with 332 additions and 177 deletions

View File

@@ -49,7 +49,7 @@ Global flags:
| `--autocorrect` | No | none | Autocorrect rules YAML file. When omitted, the default `autocorrect` module leaves text unchanged. |
| `--input-reader` | No | `json-files` | Input reader module. |
| `--output-modules` | No | `json` | Comma-separated output modules. |
| `--output-schema` | No | `default` | JSON output contract. Allowed values are `default`, `minimal`, and `seriatim`. |
| `--output-schema` | No | `seriatim-intermediate` | JSON output contract. Allowed values are `seriatim-minimal`, `seriatim-intermediate`, and `seriatim-full`. If omitted, the runtime default is used; consumers that depend on a specific shape should set this explicitly. |
| `--preprocessing-modules` | No | `validate-raw,normalize-speakers,trim-text` | Comma-separated preprocessing modules, evaluated in order. |
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
| `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |
@@ -58,7 +58,8 @@ Environment variables:
| Environment Variable | Default | Description |
| --- | --- | --- |
| `SERIATIM_OVERLAP_WORD_RUN_GAP` | `0.75` | Maximum gap in seconds between adjacent timed words when `resolve-overlaps` builds word-run replacement segments. Must be a positive float. |
| `SERIATIM_OUTPUT_SCHEMA` | `seriatim-intermediate` | Output schema used when `--output-schema` is not explicitly provided. Allowed values are `seriatim-minimal`, `seriatim-intermediate`, and `seriatim-full`. The CLI flag takes precedence. |
| `SERIATIM_OVERLAP_WORD_RUN_GAP` | `1.0` | Maximum gap in seconds between adjacent timed words when `resolve-overlaps` builds word-run replacement segments. Must be a positive float. |
| `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW` | `1.0` | Near-start window in seconds for ordering replacement word runs shortest-first. Must be a positive float. |
| `SERIATIM_BACKCHANNEL_MAX_DURATION` | `2.0` | Maximum duration in seconds for `backchannel` classification. Must be a positive float. |
| `SERIATIM_FILLER_MAX_DURATION` | `1.25` | Maximum duration in seconds for `filler` classification. Must be a positive float. |
@@ -159,14 +160,16 @@ The old `inputs:` direct mapping format is no longer supported.
`--output-modules json` controls the writer. `--output-schema` controls the JSON contract that writer serializes.
The `default` schema is the default output contract. It stays close to `minimal`, but adds optional `categories` on each segment:
The named schemas are stable public contracts. If a consumer depends on a specific shape, it should request that schema explicitly at runtime. The runtime default selection may change in a future release.
The `seriatim-intermediate` schema is the current default selection when neither `--output-schema` nor `SERIATIM_OUTPUT_SCHEMA` is set. It stays close to the minimal schema, but adds optional `categories` on each segment:
```json
{
"metadata": {
"application": "seriatim",
"version": "dev",
"output_schema": "default"
"output_schema": "seriatim-intermediate"
},
"segments": [
{
@@ -181,7 +184,7 @@ The `default` schema is the default output contract. It stays close to `minimal`
}
```
The explicit `seriatim` schema uses the full seriatim envelope:
The `seriatim-full` schema uses the full seriatim envelope:
```json
{
@@ -191,7 +194,7 @@ The explicit `seriatim` schema uses the full seriatim envelope:
"input_reader": "json-files",
"input_files": ["eric.json", "mike.json"],
"preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "coalesce", "resolve-danglers", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "resolve-danglers", "coalesce", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
"output_modules": ["json"]
},
"segments": [
@@ -231,14 +234,14 @@ The explicit `seriatim` schema uses the full seriatim envelope:
}
```
The `minimal` schema emits minimal metadata and compact ordered segments:
The `seriatim-minimal` schema emits minimal metadata and compact ordered segments:
```json
{
"metadata": {
"application": "seriatim",
"version": "dev",
"output_schema": "minimal"
"output_schema": "seriatim-minimal"
},
"segments": [
{
@@ -254,7 +257,7 @@ The `minimal` schema emits minimal metadata and compact ordered segments:
Minimal output intentionally omits categories, overlap groups, source/provenance fields, and pipeline configuration metadata.
Default output intentionally omits overlap groups and source/provenance fields, but keeps optional `categories` and minimal metadata.
Intermediate output intentionally omits overlap groups and source/provenance fields, but keeps optional `categories` and minimal metadata.
Segments are sorted deterministically by:
@@ -270,7 +273,7 @@ The public Go output contract is available from:
import "gitea.maximumdirect.net/eric/seriatim/schema"
```
The same package embeds machine-readable JSON Schemas in `schema/output.schema.json`, `schema/default-output.schema.json`, and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.
The same package embeds machine-readable JSON Schemas in `schema/full-output.schema.json`, `schema/intermediate-output.schema.json`, and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.
## Overlap Detection
@@ -295,12 +298,12 @@ For each detected overlap group, `resolve-overlaps` uses preserved WhisperX word
- The resolution window expands the detected overlap group by `--coalesce-gap` seconds on both sides.
- Nearby same-speaker context segments are included when they intersect the expanded window and their start or end is within `--coalesce-gap` of the original overlap boundary.
- Words are included when their interval intersects the expanded resolution window.
- Once a segment is selected for replacement, all timed words from that segment participate in word-run construction; the window controls segment selection, not per-word clipping.
- Context segments that are part of another detected overlap group are not pulled into the current group.
- Untimed words are included in replacement text in original word order when nearby timed words create a replacement run.
- Untimed words do not affect replacement segment start/end times or word-run gap splitting.
- Words for the same speaker are merged into one run when the gap between adjacent words is no greater than `SERIATIM_OVERLAP_WORD_RUN_GAP`.
- The default word-run gap is `0.75` seconds.
- The default word-run gap is `1.0` seconds.
- Set `SERIATIM_OVERLAP_WORD_RUN_GAP` to a positive number of seconds to override the default.
- Near-start replacement word runs are reordered so shorter segments come first when adjacent starts are within `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW`.
- The default word-run reorder window is `1.0` seconds.
@@ -339,12 +342,12 @@ The default pipeline runs `resolve-danglers` before `coalesce` and before the se
- Dangling-end fragments have no more than two words and end in punctuation.
- Dangling-start fragments have no more than two words.
- Matching uses any shared `derived_from` value.
- Matching uses same-speaker segments with any shared `derived_from` value.
- Merged segments use `source_ref` values such as `resolve-danglers:1`, keep the target segment's transcript position, and union `derived_from`.
## Coalescing
The default pipeline runs `coalesce` after `resolve-danglers` and the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when `next.start - current.end <= --coalesce-gap`.
The default pipeline runs `coalesce` after `resolve-danglers` and before the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when `next.start - current.end <= --coalesce-gap`.
Coalesced segments use `source_ref` values such as `coalesce:1`, include `derived_from`, and omit `source_segment_index`.