Added support for a minimal JSON output schema

This commit is contained in:
2026-04-28 14:39:00 -05:00
parent a3ca6665a9
commit 9cca88280f
16 changed files with 658 additions and 44 deletions

View File

@@ -49,6 +49,7 @@ Global flags:
| `--autocorrect` | No | none | Autocorrect rules YAML file. When omitted, the default `autocorrect` module leaves text unchanged. |
| `--input-reader` | No | `json-files` | Input reader module. |
| `--output-modules` | No | `json` | Comma-separated output modules. |
| `--output-schema` | No | `seriatim` | JSON output contract. Allowed values are `seriatim` and `minimal`. |
| `--preprocessing-modules` | No | `validate-raw,normalize-speakers,trim-text` | Comma-separated preprocessing modules, evaluated in order. |
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
| `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |
@@ -156,7 +157,9 @@ The old `inputs:` direct mapping format is no longer supported.
## Output JSON Format
The merged output uses the current seriatim envelope:
`--output-modules json` controls the writer. `--output-schema` controls the JSON contract that writer serializes.
The default `seriatim` schema uses the full seriatim envelope:
```json
{
@@ -206,6 +209,29 @@ The merged output uses the current seriatim envelope:
}
```
The `minimal` schema emits minimal metadata and compact ordered segments:
```json
{
"metadata": {
"application": "seriatim",
"version": "dev",
"output_schema": "minimal"
},
"segments": [
{
"id": 1,
"start": 1.25,
"end": 3.5,
"speaker": "Eric Rakestraw",
"text": "Hello there."
}
]
}
```
Minimal output intentionally omits overlap groups, categories, source/provenance fields, and pipeline configuration metadata.
Segments are sorted deterministically by:
```text
@@ -220,7 +246,7 @@ The public Go output contract is available from:
import "gitea.maximumdirect.net/eric/seriatim/schema"
```
The same package embeds the machine-readable JSON Schema in `schema/output.schema.json`. The default `validate-output` postprocessor validates the output shape and verifies final segment IDs are present, sequential, and start at `1`.
The same package embeds machine-readable JSON Schemas in `schema/output.schema.json` and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.
## Overlap Detection