Added a new JSON public schema as the default output artifact

This commit is contained in:
2026-04-28 21:32:43 -05:00
parent 80ac7e97dd
commit cc80a123ef
14 changed files with 533 additions and 12 deletions

View File

@@ -49,7 +49,7 @@ Global flags:
| `--autocorrect` | No | none | Autocorrect rules YAML file. When omitted, the default `autocorrect` module leaves text unchanged. |
| `--input-reader` | No | `json-files` | Input reader module. |
| `--output-modules` | No | `json` | Comma-separated output modules. |
| `--output-schema` | No | `seriatim` | JSON output contract. Allowed values are `seriatim` and `minimal`. |
| `--output-schema` | No | `default` | JSON output contract. Allowed values are `default`, `minimal`, and `seriatim`. |
| `--preprocessing-modules` | No | `validate-raw,normalize-speakers,trim-text` | Comma-separated preprocessing modules, evaluated in order. |
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
| `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |
@@ -159,7 +159,29 @@ The old `inputs:` direct mapping format is no longer supported.
`--output-modules json` controls the writer. `--output-schema` controls the JSON contract that writer serializes.
The default `seriatim` schema uses the full seriatim envelope:
The `default` schema is the default output contract. It stays close to `minimal`, but adds optional `categories` on each segment:
```json
{
"metadata": {
"application": "seriatim",
"version": "dev",
"output_schema": "default"
},
"segments": [
{
"id": 1,
"start": 1.25,
"end": 3.5,
"speaker": "Eric Rakestraw",
"text": "Hello there.",
"categories": ["backchannel"]
}
]
}
```
The explicit `seriatim` schema uses the full seriatim envelope:
```json
{
@@ -230,7 +252,9 @@ The `minimal` schema emits minimal metadata and compact ordered segments:
}
```
Minimal output intentionally omits overlap groups, categories, source/provenance fields, and pipeline configuration metadata.
Minimal output intentionally omits categories, overlap groups, source/provenance fields, and pipeline configuration metadata.
Default output intentionally omits overlap groups and source/provenance fields, but keeps optional `categories` and minimal metadata.
Segments are sorted deterministically by:
@@ -246,7 +270,7 @@ The public Go output contract is available from:
import "gitea.maximumdirect.net/eric/seriatim/schema"
```
The same package embeds machine-readable JSON Schemas in `schema/output.schema.json` and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.
The same package embeds machine-readable JSON Schemas in `schema/output.schema.json`, `schema/default-output.schema.json`, and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.
## Overlap Detection