Compare commits
6 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 6dbb7ab17e | |||
| 3591041fa8 | |||
| 5b008e272c | |||
| 6c780f6293 | |||
| c132f3fd5d | |||
| 3679435063 |
60
README.md
60
README.md
@@ -1,8 +1,8 @@
|
|||||||
# seriatim
|
# seriatim
|
||||||
|
|
||||||
`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID.
|
`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID and normalizes external transcript-like JSON into standard seriatim output schemas.
|
||||||
|
|
||||||
The current implementation supports the `merge` and `trim` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset.
|
The current implementation supports the `merge`, `trim`, and `normalize` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset. `normalize` reads transcript-like JSON input, validates required segment fields, sorts deterministically, assigns fresh IDs, and emits a selected seriatim output schema.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
@@ -34,11 +34,30 @@ go run ./cmd/seriatim trim \
|
|||||||
--keep "1-10, 15, 20-25"
|
--keep "1-10, 15, 20-25"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Normalize external transcript-style JSON:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
go run ./cmd/seriatim normalize \
|
||||||
|
--input-file transcript.json \
|
||||||
|
--output-file normalized.json
|
||||||
|
```
|
||||||
|
|
||||||
|
Normalize an Audita-style bare segment array to full schema with report output:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
go run ./cmd/seriatim normalize \
|
||||||
|
--input-file audita-segments.json \
|
||||||
|
--output-file normalized-full.json \
|
||||||
|
--output-schema seriatim-full \
|
||||||
|
--report-file normalize-report.json
|
||||||
|
```
|
||||||
|
|
||||||
## CLI
|
## CLI
|
||||||
|
|
||||||
```text
|
```text
|
||||||
seriatim merge [flags]
|
seriatim merge [flags]
|
||||||
seriatim trim [flags]
|
seriatim trim [flags]
|
||||||
|
seriatim normalize [flags]
|
||||||
```
|
```
|
||||||
|
|
||||||
Global flags:
|
Global flags:
|
||||||
@@ -108,6 +127,43 @@ Global flags:
|
|||||||
- The report includes a `trim-audit` event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping.
|
- The report includes a `trim-audit` event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping.
|
||||||
- Old-to-new ID mapping is emitted as a deterministic ordered array of `{old_id, new_id}` pairs.
|
- Old-to-new ID mapping is emitted as a deterministic ordered array of `{old_id, new_id}` pairs.
|
||||||
|
|
||||||
|
`normalize` flags:
|
||||||
|
|
||||||
|
| Flag | Required | Default | Description |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| `--input-file` | Yes | none | Input transcript JSON file. |
|
||||||
|
| `--output-file` | Yes | none | Normalized transcript JSON output path. |
|
||||||
|
| `--output-schema` | No | `seriatim-intermediate` (resolved via `SERIATIM_OUTPUT_SCHEMA` when set) | Output JSON schema: `seriatim-minimal`, `seriatim-intermediate`, or `seriatim-full`. |
|
||||||
|
| `--output-modules` | No | `json` | Comma-separated output modules. Current normalize support is `json` only. |
|
||||||
|
| `--report-file` | No | none | Optional report JSON output path. |
|
||||||
|
|
||||||
|
`normalize` input shapes:
|
||||||
|
|
||||||
|
- Top-level object with a `segments` array.
|
||||||
|
- Bare top-level array of segment objects (for example, Audita-style output).
|
||||||
|
|
||||||
|
`normalize` required segment fields:
|
||||||
|
|
||||||
|
- `start`
|
||||||
|
- `end`
|
||||||
|
- `speaker`
|
||||||
|
- `text`
|
||||||
|
|
||||||
|
`normalize` behavior:
|
||||||
|
|
||||||
|
- Validates `start >= 0`, `end >= start`, and non-empty `speaker`.
|
||||||
|
- Accepts existing input `id` values as provenance only.
|
||||||
|
- Reassigns output segment IDs sequentially from `1` to `N`.
|
||||||
|
- Sorts deterministically by `(start, end, original_input_index, speaker)`.
|
||||||
|
- Uses original input order only as a tie-breaker.
|
||||||
|
- Does not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect.
|
||||||
|
- Useful for converting external transcript outputs into standard seriatim artifacts.
|
||||||
|
|
||||||
|
`normalize` report output:
|
||||||
|
|
||||||
|
- When `--report-file` is provided, normalize emits deterministic report events with input shape detection, segment counts, schema/module selections, sorting/ID diagnostics, and output write/validation summaries.
|
||||||
|
- A machine-readable `normalize-audit` event is included for downstream tooling.
|
||||||
|
|
||||||
Environment variables:
|
Environment variables:
|
||||||
|
|
||||||
| Environment Variable | Default | Description |
|
| Environment Variable | Default | Description |
|
||||||
|
|||||||
@@ -3,7 +3,8 @@
|
|||||||
`seriatim` is a deterministic transcript utility for:
|
`seriatim` is a deterministic transcript utility for:
|
||||||
|
|
||||||
- merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
|
- merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
|
||||||
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming.
|
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming, and
|
||||||
|
- canonicalizing external transcript-style JSON inputs into standard seriatim output schemas.
|
||||||
|
|
||||||
The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.
|
The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.
|
||||||
|
|
||||||
@@ -60,7 +61,7 @@ configuration check
|
|||||||
|
|
||||||
Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.
|
Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.
|
||||||
|
|
||||||
`merge` runs this pipeline. `trim` is intentionally separate from this pipeline and operates at the artifact layer.
|
`merge` runs this pipeline. `trim` and `normalize` are intentionally separate from this pipeline and operate at the artifact layer.
|
||||||
|
|
||||||
## Stage Contracts
|
## Stage Contracts
|
||||||
|
|
||||||
@@ -214,6 +215,24 @@ Design constraints:
|
|||||||
|
|
||||||
`trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
|
`trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
|
||||||
|
|
||||||
|
### 8. Artifact Canonicalization Stage (`normalize` command)
|
||||||
|
|
||||||
|
`normalize` is an artifact-level command that reads transcript-like JSON and emits a standard seriatim output artifact in a selected schema.
|
||||||
|
|
||||||
|
Design constraints:
|
||||||
|
|
||||||
|
- `normalize` runs outside the merge pipeline and does not invoke merge preprocessing or postprocessing modules.
|
||||||
|
- `normalize` accepts two input shapes: object-with-`segments` and bare segment arrays.
|
||||||
|
- `normalize` validates required segment fields (`start`, `end`, `speaker`, `text`) and timing/speaker constraints.
|
||||||
|
- `normalize` sorts segments deterministically by chronological keys and stable input-index tie-breakers.
|
||||||
|
- `normalize` assigns fresh sequential output IDs (`1..N`) after sorting.
|
||||||
|
- `normalize` validates final output against the selected schema before writing.
|
||||||
|
- `normalize` writes optional deterministic report diagnostics when `--report-file` is requested.
|
||||||
|
|
||||||
|
`normalize` is intended for canonicalizing external transcript outputs (including Audita-style bare arrays) into seriatim contracts, not for running merge-time language or overlap transformations.
|
||||||
|
|
||||||
|
`normalize` must not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect.
|
||||||
|
|
||||||
## Module Classification
|
## Module Classification
|
||||||
|
|
||||||
Modules should be classified by their contract and allowed effects.
|
Modules should be classified by their contract and allowed effects.
|
||||||
@@ -442,6 +461,13 @@ Trim-specific determinism requirements:
|
|||||||
- Old-to-new ID mapping in trim reports is emitted in deterministic order.
|
- Old-to-new ID mapping in trim reports is emitted in deterministic order.
|
||||||
- Full-schema overlap recomputation is deterministic for the same input artifact and selector.
|
- Full-schema overlap recomputation is deterministic for the same input artifact and selector.
|
||||||
|
|
||||||
|
Normalize-specific determinism requirements:
|
||||||
|
|
||||||
|
- Input-shape detection is deterministic.
|
||||||
|
- Segment ordering is deterministic for identical input data.
|
||||||
|
- Output IDs are always reassigned sequentially after deterministic sorting.
|
||||||
|
- Normalize diagnostic reports are deterministic for identical inputs and configuration.
|
||||||
|
|
||||||
## Go Package Layout
|
## Go Package Layout
|
||||||
|
|
||||||
```text
|
```text
|
||||||
@@ -450,6 +476,7 @@ internal/config/ CLI/env/config loading and validation
|
|||||||
internal/pipeline/ Pipeline orchestration and module registry
|
internal/pipeline/ Pipeline orchestration and module registry
|
||||||
internal/builtin/ Built-in pipeline modules
|
internal/builtin/ Built-in pipeline modules
|
||||||
internal/artifact/ Conversion from internal model to public output schema
|
internal/artifact/ Conversion from internal model to public output schema
|
||||||
|
internal/normalize/ Normalize input parsing, validation, deterministic sorting, schema conversion, and diagnostics
|
||||||
internal/trim/ Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
|
internal/trim/ Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
|
||||||
internal/buildinfo/ Build-time version metadata
|
internal/buildinfo/ Build-time version metadata
|
||||||
internal/speaker/ Speaker map parsing and lookup
|
internal/speaker/ Speaker map parsing and lookup
|
||||||
@@ -468,6 +495,12 @@ For trim:
|
|||||||
- CLI command code handles only flag parsing, file I/O, and report emission.
|
- CLI command code handles only flag parsing, file I/O, and report emission.
|
||||||
- Transform logic is deterministic and pure except for command-layer I/O.
|
- Transform logic is deterministic and pure except for command-layer I/O.
|
||||||
|
|
||||||
|
For normalize:
|
||||||
|
|
||||||
|
- `internal/normalize` contains parsing/validation and deterministic schema conversion logic.
|
||||||
|
- CLI command code handles flag parsing and delegates execution.
|
||||||
|
- Normalize remains artifact-level and does not compose merge pipeline modules.
|
||||||
|
|
||||||
## Default Modules
|
## Default Modules
|
||||||
|
|
||||||
The default pipeline is equivalent to explicit module lists.
|
The default pipeline is equivalent to explicit module lists.
|
||||||
|
|||||||
39
internal/cli/normalize.go
Normal file
39
internal/cli/normalize.go
Normal file
@@ -0,0 +1,39 @@
|
|||||||
|
package cli
|
||||||
|
|
||||||
|
import (
|
||||||
|
"github.com/spf13/cobra"
|
||||||
|
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/config"
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/normalize"
|
||||||
|
)
|
||||||
|
|
||||||
|
func newNormalizeCommand() *cobra.Command {
|
||||||
|
var opts config.NormalizeOptions
|
||||||
|
|
||||||
|
cmd := &cobra.Command{
|
||||||
|
Use: "normalize",
|
||||||
|
Short: "Normalize a transcript artifact into a standard seriatim output shape",
|
||||||
|
RunE: func(cmd *cobra.Command, args []string) error {
|
||||||
|
normalizeOpts := opts
|
||||||
|
if !cmd.Flags().Changed("output-schema") {
|
||||||
|
normalizeOpts.OutputSchema = ""
|
||||||
|
}
|
||||||
|
|
||||||
|
cfg, err := config.NewNormalizeConfig(normalizeOpts)
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
|
||||||
|
return normalize.Run(cmd.Context(), cfg)
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
flags := cmd.Flags()
|
||||||
|
flags.StringVar(&opts.InputFile, "input-file", "", "input transcript JSON file")
|
||||||
|
flags.StringVar(&opts.OutputFile, "output-file", "", "output transcript JSON file")
|
||||||
|
flags.StringVar(&opts.ReportFile, "report-file", "", "optional report JSON file")
|
||||||
|
flags.StringVar(&opts.OutputSchema, "output-schema", config.DefaultOutputSchema, "output JSON schema: seriatim-minimal, seriatim-intermediate, or seriatim-full")
|
||||||
|
flags.StringVar(&opts.OutputModules, "output-modules", config.DefaultOutputModules, "comma-separated output modules")
|
||||||
|
|
||||||
|
return cmd
|
||||||
|
}
|
||||||
457
internal/cli/normalize_test.go
Normal file
457
internal/cli/normalize_test.go
Normal file
@@ -0,0 +1,457 @@
|
|||||||
|
package cli
|
||||||
|
|
||||||
|
import (
|
||||||
|
"encoding/json"
|
||||||
|
"os"
|
||||||
|
"path/filepath"
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/config"
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/report"
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/schema"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestNormalizeCommandIsRecognized(t *testing.T) {
|
||||||
|
cmd := NewRootCommand()
|
||||||
|
cmd.SetArgs([]string{"normalize", "--help"})
|
||||||
|
if err := cmd.Execute(); err != nil {
|
||||||
|
t.Fatalf("normalize command should be recognized: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeMissingInputFileFails(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--output-file", output,
|
||||||
|
)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected missing input-file error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "--input-file is required") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeMissingOutputFileFails(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `{"segments":[]}`)
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected missing output-file error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "--output-file is required") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeInvalidOutputSchemaFails(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `{"segments":[]}`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--output-schema", "compact",
|
||||||
|
)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected invalid output schema error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "--output-schema must be one of") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeInvalidOutputModuleFails(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `{"segments":[]}`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--output-modules", "yaml",
|
||||||
|
)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected invalid output module error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "unknown output module") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeDefaultOutputSchemaIsIntermediate(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `{
|
||||||
|
"segments": [
|
||||||
|
{"id": 99, "start": 5, "end": 6, "speaker": "Bob", "text": "second", "categories": ["filler"]},
|
||||||
|
{"id": 10, "start": 1, "end": 2, "speaker": "Alice", "text": "first", "categories": ["backchannel"]}
|
||||||
|
]
|
||||||
|
}`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var transcript schema.IntermediateTranscript
|
||||||
|
readJSON(t, output, &transcript)
|
||||||
|
if transcript.Metadata.OutputSchema != config.OutputSchemaIntermediate {
|
||||||
|
t.Fatalf("output schema = %q, want %q", transcript.Metadata.OutputSchema, config.OutputSchemaIntermediate)
|
||||||
|
}
|
||||||
|
if len(transcript.Segments) != 2 {
|
||||||
|
t.Fatalf("segment count = %d, want 2", len(transcript.Segments))
|
||||||
|
}
|
||||||
|
if transcript.Segments[0].ID != 1 || transcript.Segments[1].ID != 2 {
|
||||||
|
t.Fatalf("segment IDs = %d,%d, want 1,2", transcript.Segments[0].ID, transcript.Segments[1].ID)
|
||||||
|
}
|
||||||
|
if transcript.Segments[0].Text != "first" || transcript.Segments[1].Text != "second" {
|
||||||
|
t.Fatalf("unexpected sort order: %#v", transcript.Segments)
|
||||||
|
}
|
||||||
|
if len(transcript.Segments[0].Categories) != 1 || transcript.Segments[0].Categories[0] != "backchannel" {
|
||||||
|
t.Fatalf("expected categories preserved on first segment, got %#v", transcript.Segments[0].Categories)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeBareArrayInputToIntermediateOutput(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `[
|
||||||
|
{"start": 2, "end": 3, "speaker": "Bob", "text": "second"},
|
||||||
|
{"start": 1, "end": 2, "speaker": "Alice", "text": "first"}
|
||||||
|
]`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--output-schema", config.OutputSchemaIntermediate,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var transcript schema.IntermediateTranscript
|
||||||
|
readJSON(t, output, &transcript)
|
||||||
|
if len(transcript.Segments) != 2 {
|
||||||
|
t.Fatalf("segment count = %d, want 2", len(transcript.Segments))
|
||||||
|
}
|
||||||
|
if transcript.Segments[0].Speaker != "Alice" || transcript.Segments[1].Speaker != "Bob" {
|
||||||
|
t.Fatalf("unexpected sorted speakers: %#v", transcript.Segments)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeInputIndexTieBreakerIsDeterministic(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `[
|
||||||
|
{"start": 1, "end": 2, "speaker": "Zulu", "text": "first in"},
|
||||||
|
{"start": 1, "end": 2, "speaker": "Alpha", "text": "second in"}
|
||||||
|
]`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var transcript schema.IntermediateTranscript
|
||||||
|
readJSON(t, output, &transcript)
|
||||||
|
if transcript.Segments[0].Speaker != "Zulu" || transcript.Segments[1].Speaker != "Alpha" {
|
||||||
|
t.Fatalf("tie-break order mismatch: %#v", transcript.Segments)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeMinimalSchemaOmitsCategories(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `{
|
||||||
|
"segments": [
|
||||||
|
{"start": 1, "end": 2, "speaker": "Alice", "text": "first", "categories": ["filler"]}
|
||||||
|
]
|
||||||
|
}`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--output-schema", config.OutputSchemaMinimal,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var transcript schema.MinimalTranscript
|
||||||
|
readJSON(t, output, &transcript)
|
||||||
|
if transcript.Metadata.OutputSchema != config.OutputSchemaMinimal {
|
||||||
|
t.Fatalf("output schema = %q, want %q", transcript.Metadata.OutputSchema, config.OutputSchemaMinimal)
|
||||||
|
}
|
||||||
|
if len(transcript.Segments) != 1 || transcript.Segments[0].ID != 1 {
|
||||||
|
t.Fatalf("unexpected minimal output: %#v", transcript.Segments)
|
||||||
|
}
|
||||||
|
bytes, readErr := os.ReadFile(output)
|
||||||
|
if readErr != nil {
|
||||||
|
t.Fatalf("read output: %v", readErr)
|
||||||
|
}
|
||||||
|
if strings.Contains(string(bytes), "categories") {
|
||||||
|
t.Fatalf("minimal output unexpectedly contains categories:\n%s", string(bytes))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeFullSchemaOutputValidatesAndHasProvenanceFallback(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `[
|
||||||
|
{"start": 1, "end": 2, "speaker": "Alice", "text": "first"},
|
||||||
|
{"start": 3, "end": 4, "speaker": "Bob", "text": "second", "source":"custom.json", "source_segment_index": 7}
|
||||||
|
]`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--output-schema", config.OutputSchemaFull,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var transcript schema.Transcript
|
||||||
|
readJSON(t, output, &transcript)
|
||||||
|
if err := schema.ValidateTranscript(transcript); err != nil {
|
||||||
|
t.Fatalf("full output should validate: %v", err)
|
||||||
|
}
|
||||||
|
if len(transcript.Segments) != 2 {
|
||||||
|
t.Fatalf("segment count = %d, want 2", len(transcript.Segments))
|
||||||
|
}
|
||||||
|
if transcript.Segments[0].Source != filepath.Base(input) {
|
||||||
|
t.Fatalf("source fallback = %q, want %q", transcript.Segments[0].Source, filepath.Base(input))
|
||||||
|
}
|
||||||
|
if transcript.Segments[0].SourceSegmentIndex == nil || *transcript.Segments[0].SourceSegmentIndex != 0 {
|
||||||
|
t.Fatalf("source_segment_index fallback = %v, want 0", transcript.Segments[0].SourceSegmentIndex)
|
||||||
|
}
|
||||||
|
if transcript.Segments[1].Source != "custom.json" {
|
||||||
|
t.Fatalf("explicit source preserved = %q, want custom.json", transcript.Segments[1].Source)
|
||||||
|
}
|
||||||
|
if transcript.Segments[1].SourceSegmentIndex == nil || *transcript.Segments[1].SourceSegmentIndex != 7 {
|
||||||
|
t.Fatalf("explicit source_segment_index preserved = %v, want 7", transcript.Segments[1].SourceSegmentIndex)
|
||||||
|
}
|
||||||
|
if transcript.OverlapGroups == nil || len(transcript.OverlapGroups) != 0 {
|
||||||
|
t.Fatalf("overlap_groups = %#v, want empty array", transcript.OverlapGroups)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeEmptySegmentsArrayProducesValidOutput(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `{"segments":[]}`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var transcript schema.IntermediateTranscript
|
||||||
|
readJSON(t, output, &transcript)
|
||||||
|
if len(transcript.Segments) != 0 {
|
||||||
|
t.Fatalf("segment count = %d, want 0", len(transcript.Segments))
|
||||||
|
}
|
||||||
|
if err := schema.ValidateIntermediateTranscript(transcript); err != nil {
|
||||||
|
t.Fatalf("intermediate output should validate: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeSelectedOutputSchemaIsHonored(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `{"segments":[{"start":1,"end":2,"speaker":"A","text":"one"}]}`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--output-schema", config.OutputSchemaMinimal,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var transcript schema.MinimalTranscript
|
||||||
|
readJSON(t, output, &transcript)
|
||||||
|
if transcript.Metadata.OutputSchema != config.OutputSchemaMinimal {
|
||||||
|
t.Fatalf("output schema = %q, want %q", transcript.Metadata.OutputSchema, config.OutputSchemaMinimal)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeReportFileWrittenAndContainsObjectInputShape(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `{"segments":[{"start":1,"end":2,"speaker":"A","text":"one"}]}`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
reportPath := filepath.Join(dir, "report.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--report-file", reportPath,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var rpt report.Report
|
||||||
|
readJSON(t, reportPath, &rpt)
|
||||||
|
audit := extractNormalizeAudit(t, rpt)
|
||||||
|
if audit.InputShape != "object_with_segments" {
|
||||||
|
t.Fatalf("input shape = %q, want object_with_segments", audit.InputShape)
|
||||||
|
}
|
||||||
|
if audit.InputSegmentCount != 1 {
|
||||||
|
t.Fatalf("input segment count = %d, want 1", audit.InputSegmentCount)
|
||||||
|
}
|
||||||
|
if audit.OutputSchema != config.OutputSchemaIntermediate {
|
||||||
|
t.Fatalf("output schema = %q, want %q", audit.OutputSchema, config.OutputSchemaIntermediate)
|
||||||
|
}
|
||||||
|
if len(audit.OutputModules) != 1 || audit.OutputModules[0] != "json" {
|
||||||
|
t.Fatalf("output modules = %v, want [json]", audit.OutputModules)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeReportIncludesBareArrayShape(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `[{"start":1,"end":2,"speaker":"A","text":"one"}]`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
reportPath := filepath.Join(dir, "report.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--report-file", reportPath,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var rpt report.Report
|
||||||
|
readJSON(t, reportPath, &rpt)
|
||||||
|
audit := extractNormalizeAudit(t, rpt)
|
||||||
|
if audit.InputShape != "bare_segments_array" {
|
||||||
|
t.Fatalf("input shape = %q, want bare_segments_array", audit.InputShape)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeReportDoesNotIncludeTranscriptText(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
const segmentText = "normalize-report-secret-text"
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `[{"start":1,"end":2,"speaker":"A","text":"`+segmentText+`"}]`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
reportPath := filepath.Join(dir, "report.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--report-file", reportPath,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var rpt report.Report
|
||||||
|
readJSON(t, reportPath, &rpt)
|
||||||
|
for _, event := range rpt.Events {
|
||||||
|
if strings.Contains(event.Message, segmentText) {
|
||||||
|
t.Fatalf("report unexpectedly contained transcript text in event %#v", event)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeReportEmptyInputEmitsWarning(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `{"segments":[]}`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
reportPath := filepath.Join(dir, "report.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--report-file", reportPath,
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("normalize failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var rpt report.Report
|
||||||
|
readJSON(t, reportPath, &rpt)
|
||||||
|
found := false
|
||||||
|
for _, event := range rpt.Events {
|
||||||
|
if event.Stage == "normalize" && event.Module == "normalize" && event.Severity == report.SeverityWarning &&
|
||||||
|
strings.Contains(event.Message, "zero segments") {
|
||||||
|
found = true
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if !found {
|
||||||
|
t.Fatalf("expected empty transcript warning event, got %#v", rpt.Events)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeReportWriteFailureReturnsClearError(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeJSONFile(t, dir, "input.json", `{"segments":[{"start":1,"end":2,"speaker":"A","text":"one"}]}`)
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
err := executeNormalize(
|
||||||
|
"--input-file", input,
|
||||||
|
"--output-file", output,
|
||||||
|
"--report-file", dir,
|
||||||
|
)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected report write failure")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "write --report-file") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func executeNormalize(args ...string) error {
|
||||||
|
cmd := NewRootCommand()
|
||||||
|
cmd.SetArgs(append([]string{"normalize"}, args...))
|
||||||
|
return cmd.Execute()
|
||||||
|
}
|
||||||
|
|
||||||
|
type normalizeAudit struct {
|
||||||
|
Command string `json:"command"`
|
||||||
|
InputFile string `json:"input_file"`
|
||||||
|
OutputFile string `json:"output_file"`
|
||||||
|
InputShape string `json:"input_shape"`
|
||||||
|
InputSegmentCount int `json:"input_segment_count"`
|
||||||
|
OutputSchema string `json:"output_schema"`
|
||||||
|
OutputModules []string `json:"output_modules"`
|
||||||
|
IDsReassigned bool `json:"ids_reassigned"`
|
||||||
|
SortingChangedInput bool `json:"sorting_changed_input_order"`
|
||||||
|
SegmentsWithCategories int `json:"segments_with_categories"`
|
||||||
|
}
|
||||||
|
|
||||||
|
func extractNormalizeAudit(t *testing.T, rpt report.Report) normalizeAudit {
|
||||||
|
t.Helper()
|
||||||
|
for _, event := range rpt.Events {
|
||||||
|
if event.Stage == "normalize" && event.Module == "normalize-audit" {
|
||||||
|
var audit normalizeAudit
|
||||||
|
if err := json.Unmarshal([]byte(event.Message), &audit); err != nil {
|
||||||
|
t.Fatalf("decode normalize audit: %v", err)
|
||||||
|
}
|
||||||
|
return audit
|
||||||
|
}
|
||||||
|
}
|
||||||
|
t.Fatalf("missing normalize-audit event: %#v", rpt.Events)
|
||||||
|
return normalizeAudit{}
|
||||||
|
}
|
||||||
@@ -10,13 +10,14 @@ import (
|
|||||||
func NewRootCommand() *cobra.Command {
|
func NewRootCommand() *cobra.Command {
|
||||||
cmd := &cobra.Command{
|
cmd := &cobra.Command{
|
||||||
Use: "seriatim",
|
Use: "seriatim",
|
||||||
Short: "Merge per-speaker transcripts into a chronological transcript",
|
Short: "Merge, trim, and normalize transcript artifacts",
|
||||||
Version: buildinfo.Version,
|
Version: buildinfo.Version,
|
||||||
SilenceErrors: true,
|
SilenceErrors: true,
|
||||||
SilenceUsage: true,
|
SilenceUsage: true,
|
||||||
}
|
}
|
||||||
|
|
||||||
cmd.AddCommand(newMergeCommand())
|
cmd.AddCommand(newMergeCommand())
|
||||||
|
cmd.AddCommand(newNormalizeCommand())
|
||||||
cmd.AddCommand(newTrimCommand())
|
cmd.AddCommand(newTrimCommand())
|
||||||
return cmd
|
return cmd
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -58,6 +58,15 @@ type TrimOptions struct {
|
|||||||
AllowEmpty bool
|
AllowEmpty bool
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// NormalizeOptions captures raw CLI option values before validation.
|
||||||
|
type NormalizeOptions struct {
|
||||||
|
InputFile string
|
||||||
|
OutputFile string
|
||||||
|
ReportFile string
|
||||||
|
OutputSchema string
|
||||||
|
OutputModules string
|
||||||
|
}
|
||||||
|
|
||||||
// Config is the validated runtime configuration for a merge invocation.
|
// Config is the validated runtime configuration for a merge invocation.
|
||||||
type Config struct {
|
type Config struct {
|
||||||
InputFiles []string
|
InputFiles []string
|
||||||
@@ -88,6 +97,15 @@ type TrimConfig struct {
|
|||||||
AllowEmpty bool
|
AllowEmpty bool
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// NormalizeConfig is the validated runtime configuration for a normalize invocation.
|
||||||
|
type NormalizeConfig struct {
|
||||||
|
InputFile string
|
||||||
|
OutputFile string
|
||||||
|
ReportFile string
|
||||||
|
OutputSchema string
|
||||||
|
OutputModules []string
|
||||||
|
}
|
||||||
|
|
||||||
// NewMergeConfig validates raw merge options and returns normalized config.
|
// NewMergeConfig validates raw merge options and returns normalized config.
|
||||||
func NewMergeConfig(opts MergeOptions) (Config, error) {
|
func NewMergeConfig(opts MergeOptions) (Config, error) {
|
||||||
cfg := Config{
|
cfg := Config{
|
||||||
@@ -247,6 +265,54 @@ func NewTrimConfig(opts TrimOptions) (TrimConfig, error) {
|
|||||||
}, nil
|
}, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// NewNormalizeConfig validates raw normalize options and returns normalized config.
|
||||||
|
func NewNormalizeConfig(opts NormalizeOptions) (NormalizeConfig, error) {
|
||||||
|
inputFile := filepath.Clean(strings.TrimSpace(opts.InputFile))
|
||||||
|
if strings.TrimSpace(opts.InputFile) == "" {
|
||||||
|
return NormalizeConfig{}, errors.New("--input-file is required")
|
||||||
|
}
|
||||||
|
if err := requireFile(inputFile, "--input-file"); err != nil {
|
||||||
|
return NormalizeConfig{}, err
|
||||||
|
}
|
||||||
|
|
||||||
|
outputFile, err := normalizeOutputPath(opts.OutputFile, "--output-file")
|
||||||
|
if err != nil {
|
||||||
|
return NormalizeConfig{}, err
|
||||||
|
}
|
||||||
|
|
||||||
|
reportFile := ""
|
||||||
|
if strings.TrimSpace(opts.ReportFile) != "" {
|
||||||
|
reportFile, err = normalizeOutputPath(opts.ReportFile, "--report-file")
|
||||||
|
if err != nil {
|
||||||
|
return NormalizeConfig{}, err
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
outputSchema, err := resolveOutputSchema(opts.OutputSchema)
|
||||||
|
if err != nil {
|
||||||
|
return NormalizeConfig{}, err
|
||||||
|
}
|
||||||
|
|
||||||
|
outputModules, err := parseModuleList(opts.OutputModules)
|
||||||
|
if err != nil {
|
||||||
|
return NormalizeConfig{}, fmt.Errorf("--output-modules: %w", err)
|
||||||
|
}
|
||||||
|
if len(outputModules) == 0 {
|
||||||
|
return NormalizeConfig{}, errors.New("--output-modules must include at least one module")
|
||||||
|
}
|
||||||
|
if err := validateNormalizeOutputModules(outputModules); err != nil {
|
||||||
|
return NormalizeConfig{}, err
|
||||||
|
}
|
||||||
|
|
||||||
|
return NormalizeConfig{
|
||||||
|
InputFile: inputFile,
|
||||||
|
OutputFile: outputFile,
|
||||||
|
ReportFile: reportFile,
|
||||||
|
OutputSchema: outputSchema,
|
||||||
|
OutputModules: outputModules,
|
||||||
|
}, nil
|
||||||
|
}
|
||||||
|
|
||||||
func parseModuleList(value string) ([]string, error) {
|
func parseModuleList(value string) ([]string, error) {
|
||||||
value = strings.TrimSpace(value)
|
value = strings.TrimSpace(value)
|
||||||
if value == "" {
|
if value == "" {
|
||||||
@@ -400,3 +466,12 @@ func contains(values []string, target string) bool {
|
|||||||
}
|
}
|
||||||
return false
|
return false
|
||||||
}
|
}
|
||||||
|
|
||||||
|
func validateNormalizeOutputModules(modules []string) error {
|
||||||
|
for _, module := range modules {
|
||||||
|
if module != "json" {
|
||||||
|
return fmt.Errorf("unknown output module %q", module)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|||||||
@@ -711,6 +711,107 @@ func TestNewTrimConfigRejectsInvalidOutputSchemaOverride(t *testing.T) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
func TestNewNormalizeConfigRequiresInputFile(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
_, err := NewNormalizeConfig(NormalizeOptions{
|
||||||
|
OutputFile: output,
|
||||||
|
OutputModules: DefaultOutputModules,
|
||||||
|
})
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected input-file required error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "--input-file is required") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNewNormalizeConfigRequiresOutputFile(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeTempFile(t, dir, "input.json")
|
||||||
|
|
||||||
|
_, err := NewNormalizeConfig(NormalizeOptions{
|
||||||
|
InputFile: input,
|
||||||
|
OutputModules: DefaultOutputModules,
|
||||||
|
})
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected output-file required error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "--output-file is required") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNewNormalizeConfigResolvesOutputSchemaDefaultAndEnv(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeTempFile(t, dir, "input.json")
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
t.Setenv(OutputSchemaEnv, "")
|
||||||
|
cfg, err := NewNormalizeConfig(NormalizeOptions{
|
||||||
|
InputFile: input,
|
||||||
|
OutputFile: output,
|
||||||
|
OutputModules: DefaultOutputModules,
|
||||||
|
})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("config failed: %v", err)
|
||||||
|
}
|
||||||
|
if cfg.OutputSchema != DefaultOutputSchema {
|
||||||
|
t.Fatalf("output schema = %q, want %q", cfg.OutputSchema, DefaultOutputSchema)
|
||||||
|
}
|
||||||
|
|
||||||
|
t.Setenv(OutputSchemaEnv, OutputSchemaMinimal)
|
||||||
|
cfg, err = NewNormalizeConfig(NormalizeOptions{
|
||||||
|
InputFile: input,
|
||||||
|
OutputFile: output,
|
||||||
|
OutputModules: DefaultOutputModules,
|
||||||
|
})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("config failed: %v", err)
|
||||||
|
}
|
||||||
|
if cfg.OutputSchema != OutputSchemaMinimal {
|
||||||
|
t.Fatalf("output schema = %q, want %q", cfg.OutputSchema, OutputSchemaMinimal)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNewNormalizeConfigRejectsInvalidOutputSchema(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeTempFile(t, dir, "input.json")
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
_, err := NewNormalizeConfig(NormalizeOptions{
|
||||||
|
InputFile: input,
|
||||||
|
OutputFile: output,
|
||||||
|
OutputSchema: "compact",
|
||||||
|
OutputModules: DefaultOutputModules,
|
||||||
|
})
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected output schema error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "--output-schema must be one of") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNewNormalizeConfigRejectsUnknownOutputModule(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
input := writeTempFile(t, dir, "input.json")
|
||||||
|
output := filepath.Join(dir, "normalized.json")
|
||||||
|
|
||||||
|
_, err := NewNormalizeConfig(NormalizeOptions{
|
||||||
|
InputFile: input,
|
||||||
|
OutputFile: output,
|
||||||
|
OutputModules: "json,yaml",
|
||||||
|
})
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected output module error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "unknown output module") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
func assertPositiveFloatEnvValidation(t *testing.T, envName string) {
|
func assertPositiveFloatEnvValidation(t *testing.T, envName string) {
|
||||||
t.Helper()
|
t.Helper()
|
||||||
|
|
||||||
|
|||||||
216
internal/normalize/build.go
Normal file
216
internal/normalize/build.go
Normal file
@@ -0,0 +1,216 @@
|
|||||||
|
package normalize
|
||||||
|
|
||||||
|
import (
|
||||||
|
"fmt"
|
||||||
|
"path/filepath"
|
||||||
|
"sort"
|
||||||
|
"strings"
|
||||||
|
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/artifact"
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/buildinfo"
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/config"
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/schema"
|
||||||
|
)
|
||||||
|
|
||||||
|
// BuildResult contains normalize output plus deterministic transformation diagnostics.
|
||||||
|
type BuildResult struct {
|
||||||
|
Output any
|
||||||
|
SortingChanged bool
|
||||||
|
IDsReassigned bool
|
||||||
|
SegmentsWithCategories int
|
||||||
|
}
|
||||||
|
|
||||||
|
// Build converts parsed normalize input into a selected seriatim output schema.
|
||||||
|
func Build(parsed ParsedTranscript, cfg config.NormalizeConfig) (BuildResult, error) {
|
||||||
|
ordered := sortedSegments(parsed.Segments)
|
||||||
|
sortingChanged := didSortingChangeOrder(ordered)
|
||||||
|
idsReassigned := didReassignIDs(ordered)
|
||||||
|
segmentsWithCategories := countSegmentsWithCategories(ordered)
|
||||||
|
|
||||||
|
switch cfg.OutputSchema {
|
||||||
|
case config.OutputSchemaMinimal:
|
||||||
|
output := buildMinimal(ordered)
|
||||||
|
if err := schema.ValidateMinimalTranscript(output); err != nil {
|
||||||
|
return BuildResult{}, fmt.Errorf("validate normalize output: %w", err)
|
||||||
|
}
|
||||||
|
return BuildResult{
|
||||||
|
Output: output,
|
||||||
|
SortingChanged: sortingChanged,
|
||||||
|
IDsReassigned: idsReassigned,
|
||||||
|
SegmentsWithCategories: segmentsWithCategories,
|
||||||
|
}, nil
|
||||||
|
case config.OutputSchemaIntermediate:
|
||||||
|
output := buildIntermediate(ordered)
|
||||||
|
if err := schema.ValidateIntermediateTranscript(output); err != nil {
|
||||||
|
return BuildResult{}, fmt.Errorf("validate normalize output: %w", err)
|
||||||
|
}
|
||||||
|
return BuildResult{
|
||||||
|
Output: output,
|
||||||
|
SortingChanged: sortingChanged,
|
||||||
|
IDsReassigned: idsReassigned,
|
||||||
|
SegmentsWithCategories: segmentsWithCategories,
|
||||||
|
}, nil
|
||||||
|
case config.OutputSchemaFull:
|
||||||
|
output := buildFull(ordered, cfg)
|
||||||
|
if err := schema.ValidateTranscript(output); err != nil {
|
||||||
|
return BuildResult{}, fmt.Errorf("validate normalize output: %w", err)
|
||||||
|
}
|
||||||
|
return BuildResult{
|
||||||
|
Output: output,
|
||||||
|
SortingChanged: sortingChanged,
|
||||||
|
IDsReassigned: idsReassigned,
|
||||||
|
SegmentsWithCategories: segmentsWithCategories,
|
||||||
|
}, nil
|
||||||
|
default:
|
||||||
|
return BuildResult{}, fmt.Errorf("unsupported output schema %q", cfg.OutputSchema)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func sortedSegments(input []InputSegment) []InputSegment {
|
||||||
|
ordered := make([]InputSegment, len(input))
|
||||||
|
copy(ordered, input)
|
||||||
|
sort.SliceStable(ordered, func(i, j int) bool {
|
||||||
|
left := ordered[i]
|
||||||
|
right := ordered[j]
|
||||||
|
if left.Start != right.Start {
|
||||||
|
return left.Start < right.Start
|
||||||
|
}
|
||||||
|
if left.End != right.End {
|
||||||
|
return left.End < right.End
|
||||||
|
}
|
||||||
|
if left.InputIndex != right.InputIndex {
|
||||||
|
return left.InputIndex < right.InputIndex
|
||||||
|
}
|
||||||
|
return left.Speaker < right.Speaker
|
||||||
|
})
|
||||||
|
return ordered
|
||||||
|
}
|
||||||
|
|
||||||
|
func buildMinimal(segments []InputSegment) schema.MinimalTranscript {
|
||||||
|
outputSegments := make([]schema.MinimalSegment, len(segments))
|
||||||
|
for index, segment := range segments {
|
||||||
|
outputSegments[index] = schema.MinimalSegment{
|
||||||
|
ID: index + 1,
|
||||||
|
Start: segment.Start,
|
||||||
|
End: segment.End,
|
||||||
|
Speaker: segment.Speaker,
|
||||||
|
Text: segment.Text,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return schema.MinimalTranscript{
|
||||||
|
Metadata: schema.MinimalMetadata{
|
||||||
|
Application: artifact.ApplicationName,
|
||||||
|
Version: buildinfo.Version,
|
||||||
|
OutputSchema: config.OutputSchemaMinimal,
|
||||||
|
},
|
||||||
|
Segments: outputSegments,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func buildIntermediate(segments []InputSegment) schema.IntermediateTranscript {
|
||||||
|
outputSegments := make([]schema.IntermediateSegment, len(segments))
|
||||||
|
for index, segment := range segments {
|
||||||
|
outputSegments[index] = schema.IntermediateSegment{
|
||||||
|
ID: index + 1,
|
||||||
|
Start: segment.Start,
|
||||||
|
End: segment.End,
|
||||||
|
Speaker: segment.Speaker,
|
||||||
|
Text: segment.Text,
|
||||||
|
Categories: append([]string(nil), segment.Categories...),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return schema.IntermediateTranscript{
|
||||||
|
Metadata: schema.IntermediateMetadata{
|
||||||
|
Application: artifact.ApplicationName,
|
||||||
|
Version: buildinfo.Version,
|
||||||
|
OutputSchema: config.OutputSchemaIntermediate,
|
||||||
|
},
|
||||||
|
Segments: outputSegments,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func buildFull(segments []InputSegment, cfg config.NormalizeConfig) schema.Transcript {
|
||||||
|
defaultSource := filepath.Base(cfg.InputFile)
|
||||||
|
outputSegments := make([]schema.Segment, len(segments))
|
||||||
|
for index, segment := range segments {
|
||||||
|
source := strings.TrimSpace(segment.Source)
|
||||||
|
if source == "" {
|
||||||
|
source = defaultSource
|
||||||
|
}
|
||||||
|
|
||||||
|
sourceSegmentIndex := copyIntPtr(segment.SourceSegmentIndex)
|
||||||
|
if sourceSegmentIndex == nil {
|
||||||
|
fallback := segment.InputIndex
|
||||||
|
sourceSegmentIndex = &fallback
|
||||||
|
}
|
||||||
|
|
||||||
|
outputSegments[index] = schema.Segment{
|
||||||
|
ID: index + 1,
|
||||||
|
Source: source,
|
||||||
|
SourceSegmentIndex: sourceSegmentIndex,
|
||||||
|
SourceRef: segment.SourceRef,
|
||||||
|
DerivedFrom: append([]string(nil), segment.DerivedFrom...),
|
||||||
|
Speaker: segment.Speaker,
|
||||||
|
Start: segment.Start,
|
||||||
|
End: segment.End,
|
||||||
|
Text: segment.Text,
|
||||||
|
Categories: append([]string(nil), segment.Categories...),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return schema.Transcript{
|
||||||
|
Metadata: schema.Metadata{
|
||||||
|
Application: artifact.ApplicationName,
|
||||||
|
Version: buildinfo.Version,
|
||||||
|
InputReader: "normalize-input",
|
||||||
|
InputFiles: []string{cfg.InputFile},
|
||||||
|
PreprocessingModules: []string{},
|
||||||
|
PostprocessingModules: []string{},
|
||||||
|
OutputModules: append([]string(nil), cfg.OutputModules...),
|
||||||
|
},
|
||||||
|
Segments: outputSegments,
|
||||||
|
OverlapGroups: []schema.OverlapGroup{},
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func copyIntPtr(value *int) *int {
|
||||||
|
if value == nil {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
copied := *value
|
||||||
|
return &copied
|
||||||
|
}
|
||||||
|
|
||||||
|
func didSortingChangeOrder(segments []InputSegment) bool {
|
||||||
|
for index, segment := range segments {
|
||||||
|
if segment.InputIndex != index {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
|
||||||
|
func didReassignIDs(segments []InputSegment) bool {
|
||||||
|
if len(segments) == 0 {
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
for index, segment := range segments {
|
||||||
|
newID := index + 1
|
||||||
|
if segment.OriginalID == nil || *segment.OriginalID != newID {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
|
||||||
|
func countSegmentsWithCategories(segments []InputSegment) int {
|
||||||
|
count := 0
|
||||||
|
for _, segment := range segments {
|
||||||
|
if len(segment.Categories) > 0 {
|
||||||
|
count++
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return count
|
||||||
|
}
|
||||||
121
internal/normalize/normalize.go
Normal file
121
internal/normalize/normalize.go
Normal file
@@ -0,0 +1,121 @@
|
|||||||
|
package normalize
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
|
"os"
|
||||||
|
"strings"
|
||||||
|
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/artifact"
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/buildinfo"
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/config"
|
||||||
|
"gitea.maximumdirect.net/eric/seriatim/internal/report"
|
||||||
|
)
|
||||||
|
|
||||||
|
type normalizeAudit struct {
|
||||||
|
Command string `json:"command"`
|
||||||
|
InputFile string `json:"input_file"`
|
||||||
|
OutputFile string `json:"output_file"`
|
||||||
|
InputShape string `json:"input_shape"`
|
||||||
|
InputSegmentCount int `json:"input_segment_count"`
|
||||||
|
OutputSchema string `json:"output_schema"`
|
||||||
|
OutputModules []string `json:"output_modules"`
|
||||||
|
IDsReassigned bool `json:"ids_reassigned"`
|
||||||
|
SortingChangedInput bool `json:"sorting_changed_input_order"`
|
||||||
|
SegmentsWithCategories int `json:"segments_with_categories"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// Run executes artifact-level normalization.
|
||||||
|
func Run(ctx context.Context, cfg config.NormalizeConfig) error {
|
||||||
|
if err := ctx.Err(); err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
|
||||||
|
parsed, err := ParseFile(cfg.InputFile)
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
|
||||||
|
built, err := Build(parsed, cfg)
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
|
||||||
|
if err := writeOutputJSON(cfg.OutputFile, built.Output); err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
|
||||||
|
if cfg.ReportFile != "" {
|
||||||
|
audit := normalizeAudit{
|
||||||
|
Command: "normalize",
|
||||||
|
InputFile: cfg.InputFile,
|
||||||
|
OutputFile: cfg.OutputFile,
|
||||||
|
InputShape: string(parsed.Shape),
|
||||||
|
InputSegmentCount: len(parsed.Segments),
|
||||||
|
OutputSchema: cfg.OutputSchema,
|
||||||
|
OutputModules: append([]string(nil), cfg.OutputModules...),
|
||||||
|
IDsReassigned: built.IDsReassigned,
|
||||||
|
SortingChangedInput: built.SortingChanged,
|
||||||
|
SegmentsWithCategories: built.SegmentsWithCategories,
|
||||||
|
}
|
||||||
|
auditJSON, err := json.Marshal(audit)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("marshal normalize audit: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
events := []report.Event{
|
||||||
|
report.Info("normalize", "normalize", "started normalize command"),
|
||||||
|
report.Info("normalize", "normalize", fmt.Sprintf("input file: %s", cfg.InputFile)),
|
||||||
|
report.Info("normalize", "normalize", fmt.Sprintf("detected input shape: %s", parsed.Shape)),
|
||||||
|
report.Info("normalize", "normalize", fmt.Sprintf("input segment count: %d", len(parsed.Segments))),
|
||||||
|
report.Info("normalize", "normalize", fmt.Sprintf("selected output schema: %s", cfg.OutputSchema)),
|
||||||
|
report.Info("normalize", "normalize", fmt.Sprintf("selected output modules: %s", strings.Join(cfg.OutputModules, ","))),
|
||||||
|
report.Info("normalize", "normalize", fmt.Sprintf("output file: %s", cfg.OutputFile)),
|
||||||
|
report.Info("normalize", "normalize", fmt.Sprintf("ids reassigned: %t", built.IDsReassigned)),
|
||||||
|
report.Info("normalize", "normalize", fmt.Sprintf("sorting changed input order: %t", built.SortingChanged)),
|
||||||
|
report.Info("normalize", "normalize", fmt.Sprintf("segments with categories: %d", built.SegmentsWithCategories)),
|
||||||
|
report.Info("normalize", "normalize-audit", string(auditJSON)),
|
||||||
|
}
|
||||||
|
if len(parsed.Segments) == 0 {
|
||||||
|
events = append(events, report.Warning("normalize", "normalize", "input transcript contains zero segments"))
|
||||||
|
}
|
||||||
|
events = append(events,
|
||||||
|
report.Info("normalize", "validate-output", fmt.Sprintf("validated %d output segment(s)", len(parsed.Segments))),
|
||||||
|
report.Info("output", "json", "wrote transcript JSON"),
|
||||||
|
)
|
||||||
|
|
||||||
|
rpt := report.Report{
|
||||||
|
Metadata: report.Metadata{
|
||||||
|
Application: artifact.ApplicationName,
|
||||||
|
Version: buildinfo.Version,
|
||||||
|
InputReader: "normalize-input",
|
||||||
|
InputFiles: []string{cfg.InputFile},
|
||||||
|
PreprocessingModules: []string{},
|
||||||
|
PostprocessingModules: []string{},
|
||||||
|
OutputModules: append([]string(nil), cfg.OutputModules...),
|
||||||
|
},
|
||||||
|
Events: events,
|
||||||
|
}
|
||||||
|
if err := report.WriteJSON(cfg.ReportFile, rpt); err != nil {
|
||||||
|
return fmt.Errorf("write --report-file %q: %w", cfg.ReportFile, err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func writeOutputJSON(path string, value any) error {
|
||||||
|
file, err := os.Create(path)
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
defer file.Close()
|
||||||
|
|
||||||
|
encoder := json.NewEncoder(file)
|
||||||
|
encoder.SetIndent("", " ")
|
||||||
|
if err := encoder.Encode(value); err != nil {
|
||||||
|
return fmt.Errorf("encode normalize output JSON: %w", err)
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
197
internal/normalize/parse.go
Normal file
197
internal/normalize/parse.go
Normal file
@@ -0,0 +1,197 @@
|
|||||||
|
package normalize
|
||||||
|
|
||||||
|
import (
|
||||||
|
"bytes"
|
||||||
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
|
"io"
|
||||||
|
"os"
|
||||||
|
"strings"
|
||||||
|
)
|
||||||
|
|
||||||
|
// InputShape identifies which top-level input shape was parsed.
|
||||||
|
type InputShape string
|
||||||
|
|
||||||
|
const (
|
||||||
|
ShapeObjectWithSegments InputShape = "object_with_segments"
|
||||||
|
ShapeBareSegmentsArray InputShape = "bare_segments_array"
|
||||||
|
)
|
||||||
|
|
||||||
|
// ParsedTranscript is the validated normalize input model.
|
||||||
|
type ParsedTranscript struct {
|
||||||
|
Shape InputShape
|
||||||
|
Segments []InputSegment
|
||||||
|
}
|
||||||
|
|
||||||
|
// InputSegment is a validated segment from normalize input.
|
||||||
|
type InputSegment struct {
|
||||||
|
InputIndex int
|
||||||
|
OriginalID *int
|
||||||
|
Start float64
|
||||||
|
End float64
|
||||||
|
Speaker string
|
||||||
|
Text string
|
||||||
|
Categories []string
|
||||||
|
Source string
|
||||||
|
SourceSegmentIndex *int
|
||||||
|
SourceRef string
|
||||||
|
DerivedFrom []string
|
||||||
|
OverlapGroupID *int
|
||||||
|
}
|
||||||
|
|
||||||
|
type inputSegmentPayload struct {
|
||||||
|
ID *int `json:"id"`
|
||||||
|
Start *float64 `json:"start"`
|
||||||
|
End *float64 `json:"end"`
|
||||||
|
Speaker *string `json:"speaker"`
|
||||||
|
Text *string `json:"text"`
|
||||||
|
Categories []string `json:"categories"`
|
||||||
|
Source string `json:"source"`
|
||||||
|
SourceSegmentIndex *int `json:"source_segment_index"`
|
||||||
|
SourceRef string `json:"source_ref"`
|
||||||
|
DerivedFrom []string `json:"derived_from"`
|
||||||
|
OverlapGroupID *int `json:"overlap_group_id"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// ParseFile parses normalize input JSON from file path.
|
||||||
|
func ParseFile(path string) (ParsedTranscript, error) {
|
||||||
|
file, err := os.Open(path)
|
||||||
|
if err != nil {
|
||||||
|
return ParsedTranscript{}, err
|
||||||
|
}
|
||||||
|
defer file.Close()
|
||||||
|
|
||||||
|
return ParseReader(file)
|
||||||
|
}
|
||||||
|
|
||||||
|
// ParseReader parses normalize input JSON from a reader.
|
||||||
|
func ParseReader(reader io.Reader) (ParsedTranscript, error) {
|
||||||
|
var raw json.RawMessage
|
||||||
|
decoder := json.NewDecoder(reader)
|
||||||
|
decoder.UseNumber()
|
||||||
|
if err := decoder.Decode(&raw); err != nil {
|
||||||
|
return ParsedTranscript{}, fmt.Errorf("decode normalize input JSON: %w", err)
|
||||||
|
}
|
||||||
|
if err := ensureSingleValue(decoder); err != nil {
|
||||||
|
return ParsedTranscript{}, err
|
||||||
|
}
|
||||||
|
|
||||||
|
trimmed := bytes.TrimSpace(raw)
|
||||||
|
if len(trimmed) == 0 {
|
||||||
|
return ParsedTranscript{}, fmt.Errorf("normalize input is empty")
|
||||||
|
}
|
||||||
|
|
||||||
|
switch trimmed[0] {
|
||||||
|
case '{':
|
||||||
|
return parseObjectShape(trimmed)
|
||||||
|
case '[':
|
||||||
|
segments, err := parseSegmentsArray(trimmed)
|
||||||
|
if err != nil {
|
||||||
|
return ParsedTranscript{}, err
|
||||||
|
}
|
||||||
|
return ParsedTranscript{
|
||||||
|
Shape: ShapeBareSegmentsArray,
|
||||||
|
Segments: segments,
|
||||||
|
}, nil
|
||||||
|
default:
|
||||||
|
return ParsedTranscript{}, fmt.Errorf("normalize input must be a top-level object with \"segments\" or a top-level segment array")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func ensureSingleValue(decoder *json.Decoder) error {
|
||||||
|
var extra json.RawMessage
|
||||||
|
err := decoder.Decode(&extra)
|
||||||
|
if err == io.EOF {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
if err == nil {
|
||||||
|
return fmt.Errorf("normalize input must contain exactly one top-level JSON value")
|
||||||
|
}
|
||||||
|
return fmt.Errorf("decode normalize input JSON: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
func parseObjectShape(raw []byte) (ParsedTranscript, error) {
|
||||||
|
var object map[string]json.RawMessage
|
||||||
|
if err := json.Unmarshal(raw, &object); err != nil {
|
||||||
|
return ParsedTranscript{}, fmt.Errorf("decode normalize object input: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
segmentsRaw, exists := object["segments"]
|
||||||
|
if !exists {
|
||||||
|
return ParsedTranscript{}, fmt.Errorf("normalize object input must contain a \"segments\" field")
|
||||||
|
}
|
||||||
|
|
||||||
|
segments, err := parseSegmentsArray(segmentsRaw)
|
||||||
|
if err != nil {
|
||||||
|
return ParsedTranscript{}, err
|
||||||
|
}
|
||||||
|
|
||||||
|
return ParsedTranscript{
|
||||||
|
Shape: ShapeObjectWithSegments,
|
||||||
|
Segments: segments,
|
||||||
|
}, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func parseSegmentsArray(raw []byte) ([]InputSegment, error) {
|
||||||
|
var segmentValues []json.RawMessage
|
||||||
|
if err := json.Unmarshal(raw, &segmentValues); err != nil {
|
||||||
|
return nil, fmt.Errorf("normalize input \"segments\" must be an array")
|
||||||
|
}
|
||||||
|
|
||||||
|
segments := make([]InputSegment, len(segmentValues))
|
||||||
|
for index, segmentRaw := range segmentValues {
|
||||||
|
segment, err := parseSegment(index, segmentRaw)
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
segments[index] = segment
|
||||||
|
}
|
||||||
|
return segments, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func parseSegment(index int, raw []byte) (InputSegment, error) {
|
||||||
|
var payload inputSegmentPayload
|
||||||
|
if err := json.Unmarshal(raw, &payload); err != nil {
|
||||||
|
return InputSegment{}, fmt.Errorf("segment %d: invalid segment object: %w", index, err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if payload.Start == nil {
|
||||||
|
return InputSegment{}, fmt.Errorf("segment %d is missing required field \"start\"", index)
|
||||||
|
}
|
||||||
|
if payload.End == nil {
|
||||||
|
return InputSegment{}, fmt.Errorf("segment %d is missing required field \"end\"", index)
|
||||||
|
}
|
||||||
|
if payload.Speaker == nil {
|
||||||
|
return InputSegment{}, fmt.Errorf("segment %d is missing required field \"speaker\"", index)
|
||||||
|
}
|
||||||
|
if payload.Text == nil {
|
||||||
|
return InputSegment{}, fmt.Errorf("segment %d is missing required field \"text\"", index)
|
||||||
|
}
|
||||||
|
|
||||||
|
if *payload.Start < 0 {
|
||||||
|
return InputSegment{}, fmt.Errorf("segment %d has start %v; start must be >= 0", index, *payload.Start)
|
||||||
|
}
|
||||||
|
if *payload.End < *payload.Start {
|
||||||
|
return InputSegment{}, fmt.Errorf("segment %d has end %v before start %v", index, *payload.End, *payload.Start)
|
||||||
|
}
|
||||||
|
|
||||||
|
speaker := strings.TrimSpace(*payload.Speaker)
|
||||||
|
if speaker == "" {
|
||||||
|
return InputSegment{}, fmt.Errorf("segment %d has empty \"speaker\"; speaker must be non-empty", index)
|
||||||
|
}
|
||||||
|
|
||||||
|
return InputSegment{
|
||||||
|
InputIndex: index,
|
||||||
|
OriginalID: payload.ID,
|
||||||
|
Start: *payload.Start,
|
||||||
|
End: *payload.End,
|
||||||
|
Speaker: speaker,
|
||||||
|
Text: *payload.Text,
|
||||||
|
Categories: append([]string(nil), payload.Categories...),
|
||||||
|
Source: payload.Source,
|
||||||
|
SourceSegmentIndex: payload.SourceSegmentIndex,
|
||||||
|
SourceRef: payload.SourceRef,
|
||||||
|
DerivedFrom: append([]string(nil), payload.DerivedFrom...),
|
||||||
|
OverlapGroupID: payload.OverlapGroupID,
|
||||||
|
}, nil
|
||||||
|
}
|
||||||
181
internal/normalize/parse_test.go
Normal file
181
internal/normalize/parse_test.go
Normal file
@@ -0,0 +1,181 @@
|
|||||||
|
package normalize
|
||||||
|
|
||||||
|
import (
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestParseReaderObjectWithSegmentsParses(t *testing.T) {
|
||||||
|
input := `{
|
||||||
|
"segments": [
|
||||||
|
{"start": 1.0, "end": 2.0, "speaker": " Alice ", "text": "hello", "id": 100}
|
||||||
|
]
|
||||||
|
}`
|
||||||
|
|
||||||
|
parsed, err := ParseReader(strings.NewReader(input))
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("parse failed: %v", err)
|
||||||
|
}
|
||||||
|
if parsed.Shape != ShapeObjectWithSegments {
|
||||||
|
t.Fatalf("shape = %q, want %q", parsed.Shape, ShapeObjectWithSegments)
|
||||||
|
}
|
||||||
|
if len(parsed.Segments) != 1 {
|
||||||
|
t.Fatalf("segment count = %d, want 1", len(parsed.Segments))
|
||||||
|
}
|
||||||
|
segment := parsed.Segments[0]
|
||||||
|
if segment.Speaker != "Alice" {
|
||||||
|
t.Fatalf("speaker = %q, want %q", segment.Speaker, "Alice")
|
||||||
|
}
|
||||||
|
if segment.OriginalID == nil || *segment.OriginalID != 100 {
|
||||||
|
t.Fatalf("original id = %v, want 100", segment.OriginalID)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderBareSegmentArrayParses(t *testing.T) {
|
||||||
|
input := `[
|
||||||
|
{"start": 1.0, "end": 2.0, "speaker": "Alice", "text": "hello"},
|
||||||
|
{"start": 3.0, "end": 4.0, "speaker": "Bob", "text": "world"}
|
||||||
|
]`
|
||||||
|
|
||||||
|
parsed, err := ParseReader(strings.NewReader(input))
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("parse failed: %v", err)
|
||||||
|
}
|
||||||
|
if parsed.Shape != ShapeBareSegmentsArray {
|
||||||
|
t.Fatalf("shape = %q, want %q", parsed.Shape, ShapeBareSegmentsArray)
|
||||||
|
}
|
||||||
|
if len(parsed.Segments) != 2 {
|
||||||
|
t.Fatalf("segment count = %d, want 2", len(parsed.Segments))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderInvalidJSONFails(t *testing.T) {
|
||||||
|
_, err := ParseReader(strings.NewReader(`{"segments":`))
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected parse error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "decode normalize input JSON") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderObjectMissingSegmentsFails(t *testing.T) {
|
||||||
|
_, err := ParseReader(strings.NewReader(`{"items":[]}`))
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected missing segments error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "must contain a \"segments\" field") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderSegmentsNotArrayFails(t *testing.T) {
|
||||||
|
_, err := ParseReader(strings.NewReader(`{"segments": {}}`))
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected segments not array error")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "\"segments\" must be an array") {
|
||||||
|
t.Fatalf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderTopLevelScalarShapesFail(t *testing.T) {
|
||||||
|
tests := []string{`"text"`, `42`, `null`, `true`}
|
||||||
|
for _, input := range tests {
|
||||||
|
_, err := ParseReader(strings.NewReader(input))
|
||||||
|
if err == nil {
|
||||||
|
t.Fatalf("expected top-level shape error for %s", input)
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "top-level object") {
|
||||||
|
t.Fatalf("unexpected error for %s: %v", input, err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderMissingStartFails(t *testing.T) {
|
||||||
|
_, err := ParseReader(strings.NewReader(`[{"end":2,"speaker":"A","text":"t"}]`))
|
||||||
|
assertContains(t, err, `missing required field "start"`)
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderMissingEndFails(t *testing.T) {
|
||||||
|
_, err := ParseReader(strings.NewReader(`[{"start":1,"speaker":"A","text":"t"}]`))
|
||||||
|
assertContains(t, err, `missing required field "end"`)
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderMissingSpeakerFails(t *testing.T) {
|
||||||
|
_, err := ParseReader(strings.NewReader(`[{"start":1,"end":2,"text":"t"}]`))
|
||||||
|
assertContains(t, err, `missing required field "speaker"`)
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderEmptySpeakerFails(t *testing.T) {
|
||||||
|
_, err := ParseReader(strings.NewReader(`[{"start":1,"end":2,"speaker":" ","text":"t"}]`))
|
||||||
|
assertContains(t, err, `speaker must be non-empty`)
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderMissingTextFails(t *testing.T) {
|
||||||
|
_, err := ParseReader(strings.NewReader(`[{"start":1,"end":2,"speaker":"A"}]`))
|
||||||
|
assertContains(t, err, `missing required field "text"`)
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderEndBeforeStartFails(t *testing.T) {
|
||||||
|
_, err := ParseReader(strings.NewReader(`[{"start":3,"end":2,"speaker":"A","text":"t"}]`))
|
||||||
|
assertContains(t, err, "before start")
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderNegativeStartFails(t *testing.T) {
|
||||||
|
_, err := ParseReader(strings.NewReader(`[{"start":-1,"end":2,"speaker":"A","text":"t"}]`))
|
||||||
|
assertContains(t, err, "start must be >= 0")
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderEmptySegmentsArrayAccepted(t *testing.T) {
|
||||||
|
parsed, err := ParseReader(strings.NewReader(`{"segments":[]}`))
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("parse failed: %v", err)
|
||||||
|
}
|
||||||
|
if len(parsed.Segments) != 0 {
|
||||||
|
t.Fatalf("segment count = %d, want 0", len(parsed.Segments))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderCategoriesPreservedWhenValid(t *testing.T) {
|
||||||
|
parsed, err := ParseReader(strings.NewReader(`[{"start":1,"end":2,"speaker":"A","text":"t","categories":["filler","backchannel"]}]`))
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("parse failed: %v", err)
|
||||||
|
}
|
||||||
|
if len(parsed.Segments) != 1 {
|
||||||
|
t.Fatalf("segment count = %d, want 1", len(parsed.Segments))
|
||||||
|
}
|
||||||
|
if len(parsed.Segments[0].Categories) != 2 {
|
||||||
|
t.Fatalf("categories length = %d, want 2", len(parsed.Segments[0].Categories))
|
||||||
|
}
|
||||||
|
if parsed.Segments[0].Categories[0] != "filler" || parsed.Segments[0].Categories[1] != "backchannel" {
|
||||||
|
t.Fatalf("categories = %v", parsed.Segments[0].Categories)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseReaderOriginalInputIndexPreserved(t *testing.T) {
|
||||||
|
input := `[
|
||||||
|
{"start":1,"end":2,"speaker":"A","text":"one"},
|
||||||
|
{"start":2,"end":3,"speaker":"B","text":"two"},
|
||||||
|
{"start":3,"end":4,"speaker":"C","text":"three"}
|
||||||
|
]`
|
||||||
|
parsed, err := ParseReader(strings.NewReader(input))
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("parse failed: %v", err)
|
||||||
|
}
|
||||||
|
for index, segment := range parsed.Segments {
|
||||||
|
if segment.InputIndex != index {
|
||||||
|
t.Fatalf("segment %d input index = %d, want %d", index, segment.InputIndex, index)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func assertContains(t *testing.T, err error, fragment string) {
|
||||||
|
t.Helper()
|
||||||
|
if err == nil {
|
||||||
|
t.Fatalf("expected error containing %q", fragment)
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), fragment) {
|
||||||
|
t.Fatalf("error = %q, want substring %q", err.Error(), fragment)
|
||||||
|
}
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user