Bugfixes and documentation cleanup for v1.0 release.
All checks were successful
ci/woodpecker/tag/release Pipeline was successful
All checks were successful
ci/woodpecker/tag/release Pipeline was successful
This commit is contained in:
31
README.md
31
README.md
@@ -49,7 +49,7 @@ Global flags:
|
||||
| `--autocorrect` | No | none | Autocorrect rules YAML file. When omitted, the default `autocorrect` module leaves text unchanged. |
|
||||
| `--input-reader` | No | `json-files` | Input reader module. |
|
||||
| `--output-modules` | No | `json` | Comma-separated output modules. |
|
||||
| `--output-schema` | No | `default` | JSON output contract. Allowed values are `default`, `minimal`, and `seriatim`. |
|
||||
| `--output-schema` | No | `seriatim-intermediate` | JSON output contract. Allowed values are `seriatim-minimal`, `seriatim-intermediate`, and `seriatim-full`. If omitted, the runtime default is used; consumers that depend on a specific shape should set this explicitly. |
|
||||
| `--preprocessing-modules` | No | `validate-raw,normalize-speakers,trim-text` | Comma-separated preprocessing modules, evaluated in order. |
|
||||
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
|
||||
| `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |
|
||||
@@ -58,7 +58,8 @@ Environment variables:
|
||||
|
||||
| Environment Variable | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `SERIATIM_OVERLAP_WORD_RUN_GAP` | `0.75` | Maximum gap in seconds between adjacent timed words when `resolve-overlaps` builds word-run replacement segments. Must be a positive float. |
|
||||
| `SERIATIM_OUTPUT_SCHEMA` | `seriatim-intermediate` | Output schema used when `--output-schema` is not explicitly provided. Allowed values are `seriatim-minimal`, `seriatim-intermediate`, and `seriatim-full`. The CLI flag takes precedence. |
|
||||
| `SERIATIM_OVERLAP_WORD_RUN_GAP` | `1.0` | Maximum gap in seconds between adjacent timed words when `resolve-overlaps` builds word-run replacement segments. Must be a positive float. |
|
||||
| `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW` | `1.0` | Near-start window in seconds for ordering replacement word runs shortest-first. Must be a positive float. |
|
||||
| `SERIATIM_BACKCHANNEL_MAX_DURATION` | `2.0` | Maximum duration in seconds for `backchannel` classification. Must be a positive float. |
|
||||
| `SERIATIM_FILLER_MAX_DURATION` | `1.25` | Maximum duration in seconds for `filler` classification. Must be a positive float. |
|
||||
@@ -159,14 +160,16 @@ The old `inputs:` direct mapping format is no longer supported.
|
||||
|
||||
`--output-modules json` controls the writer. `--output-schema` controls the JSON contract that writer serializes.
|
||||
|
||||
The `default` schema is the default output contract. It stays close to `minimal`, but adds optional `categories` on each segment:
|
||||
The named schemas are stable public contracts. If a consumer depends on a specific shape, it should request that schema explicitly at runtime. The runtime default selection may change in a future release.
|
||||
|
||||
The `seriatim-intermediate` schema is the current default selection when neither `--output-schema` nor `SERIATIM_OUTPUT_SCHEMA` is set. It stays close to the minimal schema, but adds optional `categories` on each segment:
|
||||
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"application": "seriatim",
|
||||
"version": "dev",
|
||||
"output_schema": "default"
|
||||
"output_schema": "seriatim-intermediate"
|
||||
},
|
||||
"segments": [
|
||||
{
|
||||
@@ -181,7 +184,7 @@ The `default` schema is the default output contract. It stays close to `minimal`
|
||||
}
|
||||
```
|
||||
|
||||
The explicit `seriatim` schema uses the full seriatim envelope:
|
||||
The `seriatim-full` schema uses the full seriatim envelope:
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -191,7 +194,7 @@ The explicit `seriatim` schema uses the full seriatim envelope:
|
||||
"input_reader": "json-files",
|
||||
"input_files": ["eric.json", "mike.json"],
|
||||
"preprocessing_modules": ["validate-raw", "normalize-speakers", "trim-text"],
|
||||
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "coalesce", "resolve-danglers", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
|
||||
"postprocessing_modules": ["detect-overlaps", "resolve-overlaps", "backchannel", "filler", "resolve-danglers", "coalesce", "detect-overlaps", "autocorrect", "assign-ids", "validate-output"],
|
||||
"output_modules": ["json"]
|
||||
},
|
||||
"segments": [
|
||||
@@ -231,14 +234,14 @@ The explicit `seriatim` schema uses the full seriatim envelope:
|
||||
}
|
||||
```
|
||||
|
||||
The `minimal` schema emits minimal metadata and compact ordered segments:
|
||||
The `seriatim-minimal` schema emits minimal metadata and compact ordered segments:
|
||||
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"application": "seriatim",
|
||||
"version": "dev",
|
||||
"output_schema": "minimal"
|
||||
"output_schema": "seriatim-minimal"
|
||||
},
|
||||
"segments": [
|
||||
{
|
||||
@@ -254,7 +257,7 @@ The `minimal` schema emits minimal metadata and compact ordered segments:
|
||||
|
||||
Minimal output intentionally omits categories, overlap groups, source/provenance fields, and pipeline configuration metadata.
|
||||
|
||||
Default output intentionally omits overlap groups and source/provenance fields, but keeps optional `categories` and minimal metadata.
|
||||
Intermediate output intentionally omits overlap groups and source/provenance fields, but keeps optional `categories` and minimal metadata.
|
||||
|
||||
Segments are sorted deterministically by:
|
||||
|
||||
@@ -270,7 +273,7 @@ The public Go output contract is available from:
|
||||
import "gitea.maximumdirect.net/eric/seriatim/schema"
|
||||
```
|
||||
|
||||
The same package embeds machine-readable JSON Schemas in `schema/output.schema.json`, `schema/default-output.schema.json`, and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.
|
||||
The same package embeds machine-readable JSON Schemas in `schema/full-output.schema.json`, `schema/intermediate-output.schema.json`, and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.
|
||||
|
||||
## Overlap Detection
|
||||
|
||||
@@ -295,12 +298,12 @@ For each detected overlap group, `resolve-overlaps` uses preserved WhisperX word
|
||||
|
||||
- The resolution window expands the detected overlap group by `--coalesce-gap` seconds on both sides.
|
||||
- Nearby same-speaker context segments are included when they intersect the expanded window and their start or end is within `--coalesce-gap` of the original overlap boundary.
|
||||
- Words are included when their interval intersects the expanded resolution window.
|
||||
- Once a segment is selected for replacement, all timed words from that segment participate in word-run construction; the window controls segment selection, not per-word clipping.
|
||||
- Context segments that are part of another detected overlap group are not pulled into the current group.
|
||||
- Untimed words are included in replacement text in original word order when nearby timed words create a replacement run.
|
||||
- Untimed words do not affect replacement segment start/end times or word-run gap splitting.
|
||||
- Words for the same speaker are merged into one run when the gap between adjacent words is no greater than `SERIATIM_OVERLAP_WORD_RUN_GAP`.
|
||||
- The default word-run gap is `0.75` seconds.
|
||||
- The default word-run gap is `1.0` seconds.
|
||||
- Set `SERIATIM_OVERLAP_WORD_RUN_GAP` to a positive number of seconds to override the default.
|
||||
- Near-start replacement word runs are reordered so shorter segments come first when adjacent starts are within `SERIATIM_OVERLAP_WORD_RUN_REORDER_WINDOW`.
|
||||
- The default word-run reorder window is `1.0` seconds.
|
||||
@@ -339,12 +342,12 @@ The default pipeline runs `resolve-danglers` before `coalesce` and before the se
|
||||
|
||||
- Dangling-end fragments have no more than two words and end in punctuation.
|
||||
- Dangling-start fragments have no more than two words.
|
||||
- Matching uses any shared `derived_from` value.
|
||||
- Matching uses same-speaker segments with any shared `derived_from` value.
|
||||
- Merged segments use `source_ref` values such as `resolve-danglers:1`, keep the target segment's transcript position, and union `derived_from`.
|
||||
|
||||
## Coalescing
|
||||
|
||||
The default pipeline runs `coalesce` after `resolve-danglers` and the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when `next.start - current.end <= --coalesce-gap`.
|
||||
The default pipeline runs `coalesce` after `resolve-danglers` and before the second overlap detection pass. It merges adjacent same-speaker segments in the transcript's current order when `next.start - current.end <= --coalesce-gap`.
|
||||
|
||||
Coalesced segments use `source_ref` values such as `coalesce:1`, include `derived_from`, and omit `source_segment_index`.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user