Added a new JSON public schema as the default output artifact
This commit is contained in:
32
README.md
32
README.md
@@ -49,7 +49,7 @@ Global flags:
|
||||
| `--autocorrect` | No | none | Autocorrect rules YAML file. When omitted, the default `autocorrect` module leaves text unchanged. |
|
||||
| `--input-reader` | No | `json-files` | Input reader module. |
|
||||
| `--output-modules` | No | `json` | Comma-separated output modules. |
|
||||
| `--output-schema` | No | `seriatim` | JSON output contract. Allowed values are `seriatim` and `minimal`. |
|
||||
| `--output-schema` | No | `default` | JSON output contract. Allowed values are `default`, `minimal`, and `seriatim`. |
|
||||
| `--preprocessing-modules` | No | `validate-raw,normalize-speakers,trim-text` | Comma-separated preprocessing modules, evaluated in order. |
|
||||
| `--postprocessing-modules` | No | `detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output` | Comma-separated postprocessing modules, evaluated in order. |
|
||||
| `--coalesce-gap` | No | `3.0` | Maximum same-speaker gap in seconds for `coalesce`; also used as the `resolve-overlaps` context window. Must be a non-negative float. |
|
||||
@@ -159,7 +159,29 @@ The old `inputs:` direct mapping format is no longer supported.
|
||||
|
||||
`--output-modules json` controls the writer. `--output-schema` controls the JSON contract that writer serializes.
|
||||
|
||||
The default `seriatim` schema uses the full seriatim envelope:
|
||||
The `default` schema is the default output contract. It stays close to `minimal`, but adds optional `categories` on each segment:
|
||||
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"application": "seriatim",
|
||||
"version": "dev",
|
||||
"output_schema": "default"
|
||||
},
|
||||
"segments": [
|
||||
{
|
||||
"id": 1,
|
||||
"start": 1.25,
|
||||
"end": 3.5,
|
||||
"speaker": "Eric Rakestraw",
|
||||
"text": "Hello there.",
|
||||
"categories": ["backchannel"]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The explicit `seriatim` schema uses the full seriatim envelope:
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -230,7 +252,9 @@ The `minimal` schema emits minimal metadata and compact ordered segments:
|
||||
}
|
||||
```
|
||||
|
||||
Minimal output intentionally omits overlap groups, categories, source/provenance fields, and pipeline configuration metadata.
|
||||
Minimal output intentionally omits categories, overlap groups, source/provenance fields, and pipeline configuration metadata.
|
||||
|
||||
Default output intentionally omits overlap groups and source/provenance fields, but keeps optional `categories` and minimal metadata.
|
||||
|
||||
Segments are sorted deterministically by:
|
||||
|
||||
@@ -246,7 +270,7 @@ The public Go output contract is available from:
|
||||
import "gitea.maximumdirect.net/eric/seriatim/schema"
|
||||
```
|
||||
|
||||
The same package embeds machine-readable JSON Schemas in `schema/output.schema.json` and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.
|
||||
The same package embeds machine-readable JSON Schemas in `schema/output.schema.json`, `schema/default-output.schema.json`, and `schema/minimal-output.schema.json`. The default `validate-output` postprocessor validates the selected output shape and verifies final segment IDs are present, sequential, and start at `1`.
|
||||
|
||||
## Overlap Detection
|
||||
|
||||
|
||||
Reference in New Issue
Block a user