Review normalize command architecture

Document normalize command
Add normalize report diagnostics
2026-05-09 12:38:06 +00:00 · 2026-05-09 12:35:48 +00:00 · 2026-05-09 12:34:37 +00:00 · 2026-05-09 12:32:18 +00:00 · 2026-05-09 12:29:12 +00:00 · 2026-05-09 12:26:47 +00:00
11 changed files with 1482 additions and 5 deletions
--- a/README.md
+++ b/README.md
@@ -1,8 +1,8 @@
 # seriatim

-`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID.
+`seriatim` merges per-speaker WhisperX-style JSON transcripts into a single JSON transcript that preserves speaker identity and chronological order. It also trims existing seriatim output artifacts by segment ID and normalizes external transcript-like JSON into standard seriatim output schemas.

-The current implementation supports the `merge` and `trim` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset.
+The current implementation supports the `merge`, `trim`, and `normalize` commands. `merge` reads one or more input JSON files, optionally maps each input file to a canonical speaker using `speakers.yml`, sorts all segments by timestamp, detects and resolves overlaps when word-level timing is available, assigns consecutive numeric `id` values, and writes a merged JSON artifact. `trim` reads an existing seriatim output artifact and projects it to a retained segment subset. `normalize` reads transcript-like JSON input, validates required segment fields, sorts deterministically, assigns fresh IDs, and emits a selected seriatim output schema.

 ## Usage

@@ -34,11 +34,30 @@ go run ./cmd/seriatim trim \
  --keep "1-10, 15, 20-25"
 ```

+Normalize external transcript-style JSON:
+
+```sh
+go run ./cmd/seriatim normalize \
+  --input-file transcript.json \
+  --output-file normalized.json
+```
+
+Normalize an Audita-style bare segment array to full schema with report output:
+
+```sh
+go run ./cmd/seriatim normalize \
+  --input-file audita-segments.json \
+  --output-file normalized-full.json \
+  --output-schema seriatim-full \
+  --report-file normalize-report.json
+```
+
 ## CLI

 ```text
 seriatim merge [flags]
 seriatim trim [flags]
+seriatim normalize [flags]
 ```

 Global flags:
@@ -108,6 +127,43 @@ Global flags:
 - The report includes a `trim-audit` event containing trim operation metadata, including selected IDs, retained/removed counts, removed IDs, and old-to-new segment ID mapping.
 - Old-to-new ID mapping is emitted as a deterministic ordered array of `{old_id, new_id}` pairs.

+`normalize` flags:
+
+| Flag | Required | Default | Description |
+| --- | --- | --- | --- |
+| `--input-file` | Yes | none | Input transcript JSON file. |
+| `--output-file` | Yes | none | Normalized transcript JSON output path. |
+| `--output-schema` | No | `seriatim-intermediate` (resolved via `SERIATIM_OUTPUT_SCHEMA` when set) | Output JSON schema: `seriatim-minimal`, `seriatim-intermediate`, or `seriatim-full`. |
+| `--output-modules` | No | `json` | Comma-separated output modules. Current normalize support is `json` only. |
+| `--report-file` | No | none | Optional report JSON output path. |
+
+`normalize` input shapes:
+
+- Top-level object with a `segments` array.
+- Bare top-level array of segment objects (for example, Audita-style output).
+
+`normalize` required segment fields:
+
+- `start`
+- `end`
+- `speaker`
+- `text`
+
+`normalize` behavior:
+
+- Validates `start >= 0`, `end >= start`, and non-empty `speaker`.
+- Accepts existing input `id` values as provenance only.
+- Reassigns output segment IDs sequentially from `1` to `N`.
+- Sorts deterministically by `(start, end, original_input_index, speaker)`.
+- Uses original input order only as a tie-breaker.
+- Does not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect.
+- Useful for converting external transcript outputs into standard seriatim artifacts.
+
+`normalize` report output:
+
+- When `--report-file` is provided, normalize emits deterministic report events with input shape detection, segment counts, schema/module selections, sorting/ID diagnostics, and output write/validation summaries.
+- A machine-readable `normalize-audit` event is included for downstream tooling.
+
 Environment variables:

 | Environment Variable | Default | Description |
--- a/architecture.md
+++ b/architecture.md
@@ -3,7 +3,8 @@
 `seriatim` is a deterministic transcript utility for:

 - merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming.
+- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming, and
+- canonicalizing external transcript-style JSON inputs into standard seriatim output schemas.

 The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.

@@ -60,7 +61,7 @@ configuration check

 Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.

-`merge` runs this pipeline. `trim` is intentionally separate from this pipeline and operates at the artifact layer.
+`merge` runs this pipeline. `trim` and `normalize` are intentionally separate from this pipeline and operate at the artifact layer.

 ## Stage Contracts

@@ -214,6 +215,24 @@ Design constraints:

 `trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.

+### 8. Artifact Canonicalization Stage (`normalize` command)
+
+`normalize` is an artifact-level command that reads transcript-like JSON and emits a standard seriatim output artifact in a selected schema.
+
+Design constraints:
+
+- `normalize` runs outside the merge pipeline and does not invoke merge preprocessing or postprocessing modules.
+- `normalize` accepts two input shapes: object-with-`segments` and bare segment arrays.
+- `normalize` validates required segment fields (`start`, `end`, `speaker`, `text`) and timing/speaker constraints.
+- `normalize` sorts segments deterministically by chronological keys and stable input-index tie-breakers.
+- `normalize` assigns fresh sequential output IDs (`1..N`) after sorting.
+- `normalize` validates final output against the selected schema before writing.
+- `normalize` writes optional deterministic report diagnostics when `--report-file` is requested.
+
+`normalize` is intended for canonicalizing external transcript outputs (including Audita-style bare arrays) into seriatim contracts, not for running merge-time language or overlap transformations.
+
+`normalize` must not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect.
+
 ## Module Classification

 Modules should be classified by their contract and allowed effects.
@@ -442,6 +461,13 @@ Trim-specific determinism requirements:
 - Old-to-new ID mapping in trim reports is emitted in deterministic order.
 - Full-schema overlap recomputation is deterministic for the same input artifact and selector.

+Normalize-specific determinism requirements:
+
+- Input-shape detection is deterministic.
+- Segment ordering is deterministic for identical input data.
+- Output IDs are always reassigned sequentially after deterministic sorting.
+- Normalize diagnostic reports are deterministic for identical inputs and configuration.
+
 ## Go Package Layout

 ```text
@@ -450,6 +476,7 @@ internal/config/         CLI/env/config loading and validation
 internal/pipeline/       Pipeline orchestration and module registry
 internal/builtin/        Built-in pipeline modules
 internal/artifact/       Conversion from internal model to public output schema
+internal/normalize/      Normalize input parsing, validation, deterministic sorting, schema conversion, and diagnostics
 internal/trim/           Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
 internal/buildinfo/      Build-time version metadata
 internal/speaker/        Speaker map parsing and lookup
@@ -468,6 +495,12 @@ For trim:
 - CLI command code handles only flag parsing, file I/O, and report emission.
 - Transform logic is deterministic and pure except for command-layer I/O.

+For normalize:
+
+- `internal/normalize` contains parsing/validation and deterministic schema conversion logic.
+- CLI command code handles flag parsing and delegates execution.
+- Normalize remains artifact-level and does not compose merge pipeline modules.
+
 ## Default Modules

 The default pipeline is equivalent to explicit module lists.
--- a/internal/cli/normalize.go
+++ b/internal/cli/normalize.go
@@ -0,0 +1,39 @@
+package cli
+
+import (
+	"github.com/spf13/cobra"
+
+	"gitea.maximumdirect.net/eric/seriatim/internal/config"
+	"gitea.maximumdirect.net/eric/seriatim/internal/normalize"
+)
+
+func newNormalizeCommand() *cobra.Command {
+	var opts config.NormalizeOptions
+
+	cmd := &cobra.Command{
+		Use:   "normalize",
+		Short: "Normalize a transcript artifact into a standard seriatim output shape",
+		RunE: func(cmd *cobra.Command, args []string) error {
+			normalizeOpts := opts
+			if !cmd.Flags().Changed("output-schema") {
+				normalizeOpts.OutputSchema = ""
+			}
+
+			cfg, err := config.NewNormalizeConfig(normalizeOpts)
+			if err != nil {
+				return err
+			}
+
+			return normalize.Run(cmd.Context(), cfg)
+		},
+	}
+
+	flags := cmd.Flags()
+	flags.StringVar(&opts.InputFile, "input-file", "", "input transcript JSON file")
+	flags.StringVar(&opts.OutputFile, "output-file", "", "output transcript JSON file")
+	flags.StringVar(&opts.ReportFile, "report-file", "", "optional report JSON file")
+	flags.StringVar(&opts.OutputSchema, "output-schema", config.DefaultOutputSchema, "output JSON schema: seriatim-minimal, seriatim-intermediate, or seriatim-full")
+	flags.StringVar(&opts.OutputModules, "output-modules", config.DefaultOutputModules, "comma-separated output modules")
+
+	return cmd
+}
--- a/internal/cli/normalize_test.go
+++ b/internal/cli/normalize_test.go
@@ -0,0 +1,457 @@
+package cli
+
+import (
+	"encoding/json"
+	"os"
+	"path/filepath"
+	"strings"
+	"testing"
+
+	"gitea.maximumdirect.net/eric/seriatim/internal/config"
+	"gitea.maximumdirect.net/eric/seriatim/internal/report"
+	"gitea.maximumdirect.net/eric/seriatim/schema"
+)
+
+func TestNormalizeCommandIsRecognized(t *testing.T) {
+	cmd := NewRootCommand()
+	cmd.SetArgs([]string{"normalize", "--help"})
+	if err := cmd.Execute(); err != nil {
+		t.Fatalf("normalize command should be recognized: %v", err)
+	}
+}
+
+func TestNormalizeMissingInputFileFails(t *testing.T) {
+	dir := t.TempDir()
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--output-file", output,
+	)
+	if err == nil {
+		t.Fatal("expected missing input-file error")
+	}
+	if !strings.Contains(err.Error(), "--input-file is required") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func TestNormalizeMissingOutputFileFails(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `{"segments":[]}`)
+
+	err := executeNormalize(
+		"--input-file", input,
+	)
+	if err == nil {
+		t.Fatal("expected missing output-file error")
+	}
+	if !strings.Contains(err.Error(), "--output-file is required") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func TestNormalizeInvalidOutputSchemaFails(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `{"segments":[]}`)
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--output-schema", "compact",
+	)
+	if err == nil {
+		t.Fatal("expected invalid output schema error")
+	}
+	if !strings.Contains(err.Error(), "--output-schema must be one of") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func TestNormalizeInvalidOutputModuleFails(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `{"segments":[]}`)
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--output-modules", "yaml",
+	)
+	if err == nil {
+		t.Fatal("expected invalid output module error")
+	}
+	if !strings.Contains(err.Error(), "unknown output module") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func TestNormalizeDefaultOutputSchemaIsIntermediate(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `{
+		"segments": [
+			{"id": 99, "start": 5, "end": 6, "speaker": "Bob", "text": "second", "categories": ["filler"]},
+			{"id": 10, "start": 1, "end": 2, "speaker": "Alice", "text": "first", "categories": ["backchannel"]}
+		]
+	}`)
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var transcript schema.IntermediateTranscript
+	readJSON(t, output, &transcript)
+	if transcript.Metadata.OutputSchema != config.OutputSchemaIntermediate {
+		t.Fatalf("output schema = %q, want %q", transcript.Metadata.OutputSchema, config.OutputSchemaIntermediate)
+	}
+	if len(transcript.Segments) != 2 {
+		t.Fatalf("segment count = %d, want 2", len(transcript.Segments))
+	}
+	if transcript.Segments[0].ID != 1 || transcript.Segments[1].ID != 2 {
+		t.Fatalf("segment IDs = %d,%d, want 1,2", transcript.Segments[0].ID, transcript.Segments[1].ID)
+	}
+	if transcript.Segments[0].Text != "first" || transcript.Segments[1].Text != "second" {
+		t.Fatalf("unexpected sort order: %#v", transcript.Segments)
+	}
+	if len(transcript.Segments[0].Categories) != 1 || transcript.Segments[0].Categories[0] != "backchannel" {
+		t.Fatalf("expected categories preserved on first segment, got %#v", transcript.Segments[0].Categories)
+	}
+}
+
+func TestNormalizeBareArrayInputToIntermediateOutput(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `[
+		{"start": 2, "end": 3, "speaker": "Bob", "text": "second"},
+		{"start": 1, "end": 2, "speaker": "Alice", "text": "first"}
+	]`)
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--output-schema", config.OutputSchemaIntermediate,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var transcript schema.IntermediateTranscript
+	readJSON(t, output, &transcript)
+	if len(transcript.Segments) != 2 {
+		t.Fatalf("segment count = %d, want 2", len(transcript.Segments))
+	}
+	if transcript.Segments[0].Speaker != "Alice" || transcript.Segments[1].Speaker != "Bob" {
+		t.Fatalf("unexpected sorted speakers: %#v", transcript.Segments)
+	}
+}
+
+func TestNormalizeInputIndexTieBreakerIsDeterministic(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `[
+		{"start": 1, "end": 2, "speaker": "Zulu", "text": "first in"},
+		{"start": 1, "end": 2, "speaker": "Alpha", "text": "second in"}
+	]`)
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var transcript schema.IntermediateTranscript
+	readJSON(t, output, &transcript)
+	if transcript.Segments[0].Speaker != "Zulu" || transcript.Segments[1].Speaker != "Alpha" {
+		t.Fatalf("tie-break order mismatch: %#v", transcript.Segments)
+	}
+}
+
+func TestNormalizeMinimalSchemaOmitsCategories(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `{
+		"segments": [
+			{"start": 1, "end": 2, "speaker": "Alice", "text": "first", "categories": ["filler"]}
+		]
+	}`)
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--output-schema", config.OutputSchemaMinimal,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var transcript schema.MinimalTranscript
+	readJSON(t, output, &transcript)
+	if transcript.Metadata.OutputSchema != config.OutputSchemaMinimal {
+		t.Fatalf("output schema = %q, want %q", transcript.Metadata.OutputSchema, config.OutputSchemaMinimal)
+	}
+	if len(transcript.Segments) != 1 || transcript.Segments[0].ID != 1 {
+		t.Fatalf("unexpected minimal output: %#v", transcript.Segments)
+	}
+	bytes, readErr := os.ReadFile(output)
+	if readErr != nil {
+		t.Fatalf("read output: %v", readErr)
+	}
+	if strings.Contains(string(bytes), "categories") {
+		t.Fatalf("minimal output unexpectedly contains categories:\n%s", string(bytes))
+	}
+}
+
+func TestNormalizeFullSchemaOutputValidatesAndHasProvenanceFallback(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `[
+		{"start": 1, "end": 2, "speaker": "Alice", "text": "first"},
+		{"start": 3, "end": 4, "speaker": "Bob", "text": "second", "source":"custom.json", "source_segment_index": 7}
+	]`)
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--output-schema", config.OutputSchemaFull,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var transcript schema.Transcript
+	readJSON(t, output, &transcript)
+	if err := schema.ValidateTranscript(transcript); err != nil {
+		t.Fatalf("full output should validate: %v", err)
+	}
+	if len(transcript.Segments) != 2 {
+		t.Fatalf("segment count = %d, want 2", len(transcript.Segments))
+	}
+	if transcript.Segments[0].Source != filepath.Base(input) {
+		t.Fatalf("source fallback = %q, want %q", transcript.Segments[0].Source, filepath.Base(input))
+	}
+	if transcript.Segments[0].SourceSegmentIndex == nil || *transcript.Segments[0].SourceSegmentIndex != 0 {
+		t.Fatalf("source_segment_index fallback = %v, want 0", transcript.Segments[0].SourceSegmentIndex)
+	}
+	if transcript.Segments[1].Source != "custom.json" {
+		t.Fatalf("explicit source preserved = %q, want custom.json", transcript.Segments[1].Source)
+	}
+	if transcript.Segments[1].SourceSegmentIndex == nil || *transcript.Segments[1].SourceSegmentIndex != 7 {
+		t.Fatalf("explicit source_segment_index preserved = %v, want 7", transcript.Segments[1].SourceSegmentIndex)
+	}
+	if transcript.OverlapGroups == nil || len(transcript.OverlapGroups) != 0 {
+		t.Fatalf("overlap_groups = %#v, want empty array", transcript.OverlapGroups)
+	}
+}
+
+func TestNormalizeEmptySegmentsArrayProducesValidOutput(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `{"segments":[]}`)
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var transcript schema.IntermediateTranscript
+	readJSON(t, output, &transcript)
+	if len(transcript.Segments) != 0 {
+		t.Fatalf("segment count = %d, want 0", len(transcript.Segments))
+	}
+	if err := schema.ValidateIntermediateTranscript(transcript); err != nil {
+		t.Fatalf("intermediate output should validate: %v", err)
+	}
+}
+
+func TestNormalizeSelectedOutputSchemaIsHonored(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `{"segments":[{"start":1,"end":2,"speaker":"A","text":"one"}]}`)
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--output-schema", config.OutputSchemaMinimal,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var transcript schema.MinimalTranscript
+	readJSON(t, output, &transcript)
+	if transcript.Metadata.OutputSchema != config.OutputSchemaMinimal {
+		t.Fatalf("output schema = %q, want %q", transcript.Metadata.OutputSchema, config.OutputSchemaMinimal)
+	}
+}
+
+func TestNormalizeReportFileWrittenAndContainsObjectInputShape(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `{"segments":[{"start":1,"end":2,"speaker":"A","text":"one"}]}`)
+	output := filepath.Join(dir, "normalized.json")
+	reportPath := filepath.Join(dir, "report.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--report-file", reportPath,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var rpt report.Report
+	readJSON(t, reportPath, &rpt)
+	audit := extractNormalizeAudit(t, rpt)
+	if audit.InputShape != "object_with_segments" {
+		t.Fatalf("input shape = %q, want object_with_segments", audit.InputShape)
+	}
+	if audit.InputSegmentCount != 1 {
+		t.Fatalf("input segment count = %d, want 1", audit.InputSegmentCount)
+	}
+	if audit.OutputSchema != config.OutputSchemaIntermediate {
+		t.Fatalf("output schema = %q, want %q", audit.OutputSchema, config.OutputSchemaIntermediate)
+	}
+	if len(audit.OutputModules) != 1 || audit.OutputModules[0] != "json" {
+		t.Fatalf("output modules = %v, want [json]", audit.OutputModules)
+	}
+}
+
+func TestNormalizeReportIncludesBareArrayShape(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `[{"start":1,"end":2,"speaker":"A","text":"one"}]`)
+	output := filepath.Join(dir, "normalized.json")
+	reportPath := filepath.Join(dir, "report.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--report-file", reportPath,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var rpt report.Report
+	readJSON(t, reportPath, &rpt)
+	audit := extractNormalizeAudit(t, rpt)
+	if audit.InputShape != "bare_segments_array" {
+		t.Fatalf("input shape = %q, want bare_segments_array", audit.InputShape)
+	}
+}
+
+func TestNormalizeReportDoesNotIncludeTranscriptText(t *testing.T) {
+	dir := t.TempDir()
+	const segmentText = "normalize-report-secret-text"
+	input := writeJSONFile(t, dir, "input.json", `[{"start":1,"end":2,"speaker":"A","text":"`+segmentText+`"}]`)
+	output := filepath.Join(dir, "normalized.json")
+	reportPath := filepath.Join(dir, "report.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--report-file", reportPath,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var rpt report.Report
+	readJSON(t, reportPath, &rpt)
+	for _, event := range rpt.Events {
+		if strings.Contains(event.Message, segmentText) {
+			t.Fatalf("report unexpectedly contained transcript text in event %#v", event)
+		}
+	}
+}
+
+func TestNormalizeReportEmptyInputEmitsWarning(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `{"segments":[]}`)
+	output := filepath.Join(dir, "normalized.json")
+	reportPath := filepath.Join(dir, "report.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--report-file", reportPath,
+	)
+	if err != nil {
+		t.Fatalf("normalize failed: %v", err)
+	}
+
+	var rpt report.Report
+	readJSON(t, reportPath, &rpt)
+	found := false
+	for _, event := range rpt.Events {
+		if event.Stage == "normalize" && event.Module == "normalize" && event.Severity == report.SeverityWarning &&
+			strings.Contains(event.Message, "zero segments") {
+			found = true
+			break
+		}
+	}
+	if !found {
+		t.Fatalf("expected empty transcript warning event, got %#v", rpt.Events)
+	}
+}
+
+func TestNormalizeReportWriteFailureReturnsClearError(t *testing.T) {
+	dir := t.TempDir()
+	input := writeJSONFile(t, dir, "input.json", `{"segments":[{"start":1,"end":2,"speaker":"A","text":"one"}]}`)
+	output := filepath.Join(dir, "normalized.json")
+
+	err := executeNormalize(
+		"--input-file", input,
+		"--output-file", output,
+		"--report-file", dir,
+	)
+	if err == nil {
+		t.Fatal("expected report write failure")
+	}
+	if !strings.Contains(err.Error(), "write --report-file") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func executeNormalize(args ...string) error {
+	cmd := NewRootCommand()
+	cmd.SetArgs(append([]string{"normalize"}, args...))
+	return cmd.Execute()
+}
+
+type normalizeAudit struct {
+	Command                string   `json:"command"`
+	InputFile              string   `json:"input_file"`
+	OutputFile             string   `json:"output_file"`
+	InputShape             string   `json:"input_shape"`
+	InputSegmentCount      int      `json:"input_segment_count"`
+	OutputSchema           string   `json:"output_schema"`
+	OutputModules          []string `json:"output_modules"`
+	IDsReassigned          bool     `json:"ids_reassigned"`
+	SortingChangedInput    bool     `json:"sorting_changed_input_order"`
+	SegmentsWithCategories int      `json:"segments_with_categories"`
+}
+
+func extractNormalizeAudit(t *testing.T, rpt report.Report) normalizeAudit {
+	t.Helper()
+	for _, event := range rpt.Events {
+		if event.Stage == "normalize" && event.Module == "normalize-audit" {
+			var audit normalizeAudit
+			if err := json.Unmarshal([]byte(event.Message), &audit); err != nil {
+				t.Fatalf("decode normalize audit: %v", err)
+			}
+			return audit
+		}
+	}
+	t.Fatalf("missing normalize-audit event: %#v", rpt.Events)
+	return normalizeAudit{}
+}
--- a/internal/cli/root.go
+++ b/internal/cli/root.go
@@ -10,13 +10,14 @@ import (
 func NewRootCommand() *cobra.Command {
 	cmd := &cobra.Command{
 		Use:           "seriatim",
-		Short:         "Merge per-speaker transcripts into a chronological transcript",
+		Short:         "Merge, trim, and normalize transcript artifacts",
 		Version:       buildinfo.Version,
 		SilenceErrors: true,
 		SilenceUsage:  true,
 	}

 	cmd.AddCommand(newMergeCommand())
+	cmd.AddCommand(newNormalizeCommand())
 	cmd.AddCommand(newTrimCommand())
 	return cmd
 }
--- a/internal/config/config.go
+++ b/internal/config/config.go
@@ -58,6 +58,15 @@ type TrimOptions struct {
 	AllowEmpty   bool
 }

+// NormalizeOptions captures raw CLI option values before validation.
+type NormalizeOptions struct {
+	InputFile     string
+	OutputFile    string
+	ReportFile    string
+	OutputSchema  string
+	OutputModules string
+}
+
 // Config is the validated runtime configuration for a merge invocation.
 type Config struct {
 	InputFiles             []string
@@ -88,6 +97,15 @@ type TrimConfig struct {
 	AllowEmpty   bool
 }

+// NormalizeConfig is the validated runtime configuration for a normalize invocation.
+type NormalizeConfig struct {
+	InputFile     string
+	OutputFile    string
+	ReportFile    string
+	OutputSchema  string
+	OutputModules []string
+}
+
 // NewMergeConfig validates raw merge options and returns normalized config.
 func NewMergeConfig(opts MergeOptions) (Config, error) {
 	cfg := Config{
@@ -247,6 +265,54 @@ func NewTrimConfig(opts TrimOptions) (TrimConfig, error) {
 	}, nil
 }

+// NewNormalizeConfig validates raw normalize options and returns normalized config.
+func NewNormalizeConfig(opts NormalizeOptions) (NormalizeConfig, error) {
+	inputFile := filepath.Clean(strings.TrimSpace(opts.InputFile))
+	if strings.TrimSpace(opts.InputFile) == "" {
+		return NormalizeConfig{}, errors.New("--input-file is required")
+	}
+	if err := requireFile(inputFile, "--input-file"); err != nil {
+		return NormalizeConfig{}, err
+	}
+
+	outputFile, err := normalizeOutputPath(opts.OutputFile, "--output-file")
+	if err != nil {
+		return NormalizeConfig{}, err
+	}
+
+	reportFile := ""
+	if strings.TrimSpace(opts.ReportFile) != "" {
+		reportFile, err = normalizeOutputPath(opts.ReportFile, "--report-file")
+		if err != nil {
+			return NormalizeConfig{}, err
+		}
+	}
+
+	outputSchema, err := resolveOutputSchema(opts.OutputSchema)
+	if err != nil {
+		return NormalizeConfig{}, err
+	}
+
+	outputModules, err := parseModuleList(opts.OutputModules)
+	if err != nil {
+		return NormalizeConfig{}, fmt.Errorf("--output-modules: %w", err)
+	}
+	if len(outputModules) == 0 {
+		return NormalizeConfig{}, errors.New("--output-modules must include at least one module")
+	}
+	if err := validateNormalizeOutputModules(outputModules); err != nil {
+		return NormalizeConfig{}, err
+	}
+
+	return NormalizeConfig{
+		InputFile:     inputFile,
+		OutputFile:    outputFile,
+		ReportFile:    reportFile,
+		OutputSchema:  outputSchema,
+		OutputModules: outputModules,
+	}, nil
+}
+
 func parseModuleList(value string) ([]string, error) {
 	value = strings.TrimSpace(value)
 	if value == "" {
@@ -400,3 +466,12 @@ func contains(values []string, target string) bool {
 	}
 	return false
 }
+
+func validateNormalizeOutputModules(modules []string) error {
+	for _, module := range modules {
+		if module != "json" {
+			return fmt.Errorf("unknown output module %q", module)
+		}
+	}
+	return nil
+}
--- a/internal/config/config_test.go
+++ b/internal/config/config_test.go
@@ -711,6 +711,107 @@ func TestNewTrimConfigRejectsInvalidOutputSchemaOverride(t *testing.T) {
 	}
 }

+func TestNewNormalizeConfigRequiresInputFile(t *testing.T) {
+	dir := t.TempDir()
+	output := filepath.Join(dir, "normalized.json")
+
+	_, err := NewNormalizeConfig(NormalizeOptions{
+		OutputFile:    output,
+		OutputModules: DefaultOutputModules,
+	})
+	if err == nil {
+		t.Fatal("expected input-file required error")
+	}
+	if !strings.Contains(err.Error(), "--input-file is required") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func TestNewNormalizeConfigRequiresOutputFile(t *testing.T) {
+	dir := t.TempDir()
+	input := writeTempFile(t, dir, "input.json")
+
+	_, err := NewNormalizeConfig(NormalizeOptions{
+		InputFile:     input,
+		OutputModules: DefaultOutputModules,
+	})
+	if err == nil {
+		t.Fatal("expected output-file required error")
+	}
+	if !strings.Contains(err.Error(), "--output-file is required") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func TestNewNormalizeConfigResolvesOutputSchemaDefaultAndEnv(t *testing.T) {
+	dir := t.TempDir()
+	input := writeTempFile(t, dir, "input.json")
+	output := filepath.Join(dir, "normalized.json")
+
+	t.Setenv(OutputSchemaEnv, "")
+	cfg, err := NewNormalizeConfig(NormalizeOptions{
+		InputFile:     input,
+		OutputFile:    output,
+		OutputModules: DefaultOutputModules,
+	})
+	if err != nil {
+		t.Fatalf("config failed: %v", err)
+	}
+	if cfg.OutputSchema != DefaultOutputSchema {
+		t.Fatalf("output schema = %q, want %q", cfg.OutputSchema, DefaultOutputSchema)
+	}
+
+	t.Setenv(OutputSchemaEnv, OutputSchemaMinimal)
+	cfg, err = NewNormalizeConfig(NormalizeOptions{
+		InputFile:     input,
+		OutputFile:    output,
+		OutputModules: DefaultOutputModules,
+	})
+	if err != nil {
+		t.Fatalf("config failed: %v", err)
+	}
+	if cfg.OutputSchema != OutputSchemaMinimal {
+		t.Fatalf("output schema = %q, want %q", cfg.OutputSchema, OutputSchemaMinimal)
+	}
+}
+
+func TestNewNormalizeConfigRejectsInvalidOutputSchema(t *testing.T) {
+	dir := t.TempDir()
+	input := writeTempFile(t, dir, "input.json")
+	output := filepath.Join(dir, "normalized.json")
+
+	_, err := NewNormalizeConfig(NormalizeOptions{
+		InputFile:     input,
+		OutputFile:    output,
+		OutputSchema:  "compact",
+		OutputModules: DefaultOutputModules,
+	})
+	if err == nil {
+		t.Fatal("expected output schema error")
+	}
+	if !strings.Contains(err.Error(), "--output-schema must be one of") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func TestNewNormalizeConfigRejectsUnknownOutputModule(t *testing.T) {
+	dir := t.TempDir()
+	input := writeTempFile(t, dir, "input.json")
+	output := filepath.Join(dir, "normalized.json")
+
+	_, err := NewNormalizeConfig(NormalizeOptions{
+		InputFile:     input,
+		OutputFile:    output,
+		OutputModules: "json,yaml",
+	})
+	if err == nil {
+		t.Fatal("expected output module error")
+	}
+	if !strings.Contains(err.Error(), "unknown output module") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
 func assertPositiveFloatEnvValidation(t *testing.T, envName string) {
 	t.Helper()

--- a/internal/normalize/build.go
+++ b/internal/normalize/build.go
@@ -0,0 +1,216 @@
+package normalize
+
+import (
+	"fmt"
+	"path/filepath"
+	"sort"
+	"strings"
+
+	"gitea.maximumdirect.net/eric/seriatim/internal/artifact"
+	"gitea.maximumdirect.net/eric/seriatim/internal/buildinfo"
+	"gitea.maximumdirect.net/eric/seriatim/internal/config"
+	"gitea.maximumdirect.net/eric/seriatim/schema"
+)
+
+// BuildResult contains normalize output plus deterministic transformation diagnostics.
+type BuildResult struct {
+	Output                 any
+	SortingChanged         bool
+	IDsReassigned          bool
+	SegmentsWithCategories int
+}
+
+// Build converts parsed normalize input into a selected seriatim output schema.
+func Build(parsed ParsedTranscript, cfg config.NormalizeConfig) (BuildResult, error) {
+	ordered := sortedSegments(parsed.Segments)
+	sortingChanged := didSortingChangeOrder(ordered)
+	idsReassigned := didReassignIDs(ordered)
+	segmentsWithCategories := countSegmentsWithCategories(ordered)
+
+	switch cfg.OutputSchema {
+	case config.OutputSchemaMinimal:
+		output := buildMinimal(ordered)
+		if err := schema.ValidateMinimalTranscript(output); err != nil {
+			return BuildResult{}, fmt.Errorf("validate normalize output: %w", err)
+		}
+		return BuildResult{
+			Output:                 output,
+			SortingChanged:         sortingChanged,
+			IDsReassigned:          idsReassigned,
+			SegmentsWithCategories: segmentsWithCategories,
+		}, nil
+	case config.OutputSchemaIntermediate:
+		output := buildIntermediate(ordered)
+		if err := schema.ValidateIntermediateTranscript(output); err != nil {
+			return BuildResult{}, fmt.Errorf("validate normalize output: %w", err)
+		}
+		return BuildResult{
+			Output:                 output,
+			SortingChanged:         sortingChanged,
+			IDsReassigned:          idsReassigned,
+			SegmentsWithCategories: segmentsWithCategories,
+		}, nil
+	case config.OutputSchemaFull:
+		output := buildFull(ordered, cfg)
+		if err := schema.ValidateTranscript(output); err != nil {
+			return BuildResult{}, fmt.Errorf("validate normalize output: %w", err)
+		}
+		return BuildResult{
+			Output:                 output,
+			SortingChanged:         sortingChanged,
+			IDsReassigned:          idsReassigned,
+			SegmentsWithCategories: segmentsWithCategories,
+		}, nil
+	default:
+		return BuildResult{}, fmt.Errorf("unsupported output schema %q", cfg.OutputSchema)
+	}
+}
+
+func sortedSegments(input []InputSegment) []InputSegment {
+	ordered := make([]InputSegment, len(input))
+	copy(ordered, input)
+	sort.SliceStable(ordered, func(i, j int) bool {
+		left := ordered[i]
+		right := ordered[j]
+		if left.Start != right.Start {
+			return left.Start < right.Start
+		}
+		if left.End != right.End {
+			return left.End < right.End
+		}
+		if left.InputIndex != right.InputIndex {
+			return left.InputIndex < right.InputIndex
+		}
+		return left.Speaker < right.Speaker
+	})
+	return ordered
+}
+
+func buildMinimal(segments []InputSegment) schema.MinimalTranscript {
+	outputSegments := make([]schema.MinimalSegment, len(segments))
+	for index, segment := range segments {
+		outputSegments[index] = schema.MinimalSegment{
+			ID:      index + 1,
+			Start:   segment.Start,
+			End:     segment.End,
+			Speaker: segment.Speaker,
+			Text:    segment.Text,
+		}
+	}
+
+	return schema.MinimalTranscript{
+		Metadata: schema.MinimalMetadata{
+			Application:  artifact.ApplicationName,
+			Version:      buildinfo.Version,
+			OutputSchema: config.OutputSchemaMinimal,
+		},
+		Segments: outputSegments,
+	}
+}
+
+func buildIntermediate(segments []InputSegment) schema.IntermediateTranscript {
+	outputSegments := make([]schema.IntermediateSegment, len(segments))
+	for index, segment := range segments {
+		outputSegments[index] = schema.IntermediateSegment{
+			ID:         index + 1,
+			Start:      segment.Start,
+			End:        segment.End,
+			Speaker:    segment.Speaker,
+			Text:       segment.Text,
+			Categories: append([]string(nil), segment.Categories...),
+		}
+	}
+
+	return schema.IntermediateTranscript{
+		Metadata: schema.IntermediateMetadata{
+			Application:  artifact.ApplicationName,
+			Version:      buildinfo.Version,
+			OutputSchema: config.OutputSchemaIntermediate,
+		},
+		Segments: outputSegments,
+	}
+}
+
+func buildFull(segments []InputSegment, cfg config.NormalizeConfig) schema.Transcript {
+	defaultSource := filepath.Base(cfg.InputFile)
+	outputSegments := make([]schema.Segment, len(segments))
+	for index, segment := range segments {
+		source := strings.TrimSpace(segment.Source)
+		if source == "" {
+			source = defaultSource
+		}
+
+		sourceSegmentIndex := copyIntPtr(segment.SourceSegmentIndex)
+		if sourceSegmentIndex == nil {
+			fallback := segment.InputIndex
+			sourceSegmentIndex = &fallback
+		}
+
+		outputSegments[index] = schema.Segment{
+			ID:                 index + 1,
+			Source:             source,
+			SourceSegmentIndex: sourceSegmentIndex,
+			SourceRef:          segment.SourceRef,
+			DerivedFrom:        append([]string(nil), segment.DerivedFrom...),
+			Speaker:            segment.Speaker,
+			Start:              segment.Start,
+			End:                segment.End,
+			Text:               segment.Text,
+			Categories:         append([]string(nil), segment.Categories...),
+		}
+	}
+
+	return schema.Transcript{
+		Metadata: schema.Metadata{
+			Application:           artifact.ApplicationName,
+			Version:               buildinfo.Version,
+			InputReader:           "normalize-input",
+			InputFiles:            []string{cfg.InputFile},
+			PreprocessingModules:  []string{},
+			PostprocessingModules: []string{},
+			OutputModules:         append([]string(nil), cfg.OutputModules...),
+		},
+		Segments:      outputSegments,
+		OverlapGroups: []schema.OverlapGroup{},
+	}
+}
+
+func copyIntPtr(value *int) *int {
+	if value == nil {
+		return nil
+	}
+	copied := *value
+	return &copied
+}
+
+func didSortingChangeOrder(segments []InputSegment) bool {
+	for index, segment := range segments {
+		if segment.InputIndex != index {
+			return true
+		}
+	}
+	return false
+}
+
+func didReassignIDs(segments []InputSegment) bool {
+	if len(segments) == 0 {
+		return false
+	}
+	for index, segment := range segments {
+		newID := index + 1
+		if segment.OriginalID == nil || *segment.OriginalID != newID {
+			return true
+		}
+	}
+	return false
+}
+
+func countSegmentsWithCategories(segments []InputSegment) int {
+	count := 0
+	for _, segment := range segments {
+		if len(segment.Categories) > 0 {
+			count++
+		}
+	}
+	return count
+}
--- a/internal/normalize/normalize.go
+++ b/internal/normalize/normalize.go
@@ -0,0 +1,121 @@
+package normalize
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"os"
+	"strings"
+
+	"gitea.maximumdirect.net/eric/seriatim/internal/artifact"
+	"gitea.maximumdirect.net/eric/seriatim/internal/buildinfo"
+	"gitea.maximumdirect.net/eric/seriatim/internal/config"
+	"gitea.maximumdirect.net/eric/seriatim/internal/report"
+)
+
+type normalizeAudit struct {
+	Command                string   `json:"command"`
+	InputFile              string   `json:"input_file"`
+	OutputFile             string   `json:"output_file"`
+	InputShape             string   `json:"input_shape"`
+	InputSegmentCount      int      `json:"input_segment_count"`
+	OutputSchema           string   `json:"output_schema"`
+	OutputModules          []string `json:"output_modules"`
+	IDsReassigned          bool     `json:"ids_reassigned"`
+	SortingChangedInput    bool     `json:"sorting_changed_input_order"`
+	SegmentsWithCategories int      `json:"segments_with_categories"`
+}
+
+// Run executes artifact-level normalization.
+func Run(ctx context.Context, cfg config.NormalizeConfig) error {
+	if err := ctx.Err(); err != nil {
+		return err
+	}
+
+	parsed, err := ParseFile(cfg.InputFile)
+	if err != nil {
+		return err
+	}
+
+	built, err := Build(parsed, cfg)
+	if err != nil {
+		return err
+	}
+
+	if err := writeOutputJSON(cfg.OutputFile, built.Output); err != nil {
+		return err
+	}
+
+	if cfg.ReportFile != "" {
+		audit := normalizeAudit{
+			Command:                "normalize",
+			InputFile:              cfg.InputFile,
+			OutputFile:             cfg.OutputFile,
+			InputShape:             string(parsed.Shape),
+			InputSegmentCount:      len(parsed.Segments),
+			OutputSchema:           cfg.OutputSchema,
+			OutputModules:          append([]string(nil), cfg.OutputModules...),
+			IDsReassigned:          built.IDsReassigned,
+			SortingChangedInput:    built.SortingChanged,
+			SegmentsWithCategories: built.SegmentsWithCategories,
+		}
+		auditJSON, err := json.Marshal(audit)
+		if err != nil {
+			return fmt.Errorf("marshal normalize audit: %w", err)
+		}
+
+		events := []report.Event{
+			report.Info("normalize", "normalize", "started normalize command"),
+			report.Info("normalize", "normalize", fmt.Sprintf("input file: %s", cfg.InputFile)),
+			report.Info("normalize", "normalize", fmt.Sprintf("detected input shape: %s", parsed.Shape)),
+			report.Info("normalize", "normalize", fmt.Sprintf("input segment count: %d", len(parsed.Segments))),
+			report.Info("normalize", "normalize", fmt.Sprintf("selected output schema: %s", cfg.OutputSchema)),
+			report.Info("normalize", "normalize", fmt.Sprintf("selected output modules: %s", strings.Join(cfg.OutputModules, ","))),
+			report.Info("normalize", "normalize", fmt.Sprintf("output file: %s", cfg.OutputFile)),
+			report.Info("normalize", "normalize", fmt.Sprintf("ids reassigned: %t", built.IDsReassigned)),
+			report.Info("normalize", "normalize", fmt.Sprintf("sorting changed input order: %t", built.SortingChanged)),
+			report.Info("normalize", "normalize", fmt.Sprintf("segments with categories: %d", built.SegmentsWithCategories)),
+			report.Info("normalize", "normalize-audit", string(auditJSON)),
+		}
+		if len(parsed.Segments) == 0 {
+			events = append(events, report.Warning("normalize", "normalize", "input transcript contains zero segments"))
+		}
+		events = append(events,
+			report.Info("normalize", "validate-output", fmt.Sprintf("validated %d output segment(s)", len(parsed.Segments))),
+			report.Info("output", "json", "wrote transcript JSON"),
+		)
+
+		rpt := report.Report{
+			Metadata: report.Metadata{
+				Application:           artifact.ApplicationName,
+				Version:               buildinfo.Version,
+				InputReader:           "normalize-input",
+				InputFiles:            []string{cfg.InputFile},
+				PreprocessingModules:  []string{},
+				PostprocessingModules: []string{},
+				OutputModules:         append([]string(nil), cfg.OutputModules...),
+			},
+			Events: events,
+		}
+		if err := report.WriteJSON(cfg.ReportFile, rpt); err != nil {
+			return fmt.Errorf("write --report-file %q: %w", cfg.ReportFile, err)
+		}
+	}
+
+	return nil
+}
+
+func writeOutputJSON(path string, value any) error {
+	file, err := os.Create(path)
+	if err != nil {
+		return err
+	}
+	defer file.Close()
+
+	encoder := json.NewEncoder(file)
+	encoder.SetIndent("", "  ")
+	if err := encoder.Encode(value); err != nil {
+		return fmt.Errorf("encode normalize output JSON: %w", err)
+	}
+	return nil
+}
--- a/internal/normalize/parse.go
+++ b/internal/normalize/parse.go
@@ -0,0 +1,197 @@
+package normalize
+
+import (
+	"bytes"
+	"encoding/json"
+	"fmt"
+	"io"
+	"os"
+	"strings"
+)
+
+// InputShape identifies which top-level input shape was parsed.
+type InputShape string
+
+const (
+	ShapeObjectWithSegments InputShape = "object_with_segments"
+	ShapeBareSegmentsArray  InputShape = "bare_segments_array"
+)
+
+// ParsedTranscript is the validated normalize input model.
+type ParsedTranscript struct {
+	Shape    InputShape
+	Segments []InputSegment
+}
+
+// InputSegment is a validated segment from normalize input.
+type InputSegment struct {
+	InputIndex         int
+	OriginalID         *int
+	Start              float64
+	End                float64
+	Speaker            string
+	Text               string
+	Categories         []string
+	Source             string
+	SourceSegmentIndex *int
+	SourceRef          string
+	DerivedFrom        []string
+	OverlapGroupID     *int
+}
+
+type inputSegmentPayload struct {
+	ID                 *int     `json:"id"`
+	Start              *float64 `json:"start"`
+	End                *float64 `json:"end"`
+	Speaker            *string  `json:"speaker"`
+	Text               *string  `json:"text"`
+	Categories         []string `json:"categories"`
+	Source             string   `json:"source"`
+	SourceSegmentIndex *int     `json:"source_segment_index"`
+	SourceRef          string   `json:"source_ref"`
+	DerivedFrom        []string `json:"derived_from"`
+	OverlapGroupID     *int     `json:"overlap_group_id"`
+}
+
+// ParseFile parses normalize input JSON from file path.
+func ParseFile(path string) (ParsedTranscript, error) {
+	file, err := os.Open(path)
+	if err != nil {
+		return ParsedTranscript{}, err
+	}
+	defer file.Close()
+
+	return ParseReader(file)
+}
+
+// ParseReader parses normalize input JSON from a reader.
+func ParseReader(reader io.Reader) (ParsedTranscript, error) {
+	var raw json.RawMessage
+	decoder := json.NewDecoder(reader)
+	decoder.UseNumber()
+	if err := decoder.Decode(&raw); err != nil {
+		return ParsedTranscript{}, fmt.Errorf("decode normalize input JSON: %w", err)
+	}
+	if err := ensureSingleValue(decoder); err != nil {
+		return ParsedTranscript{}, err
+	}
+
+	trimmed := bytes.TrimSpace(raw)
+	if len(trimmed) == 0 {
+		return ParsedTranscript{}, fmt.Errorf("normalize input is empty")
+	}
+
+	switch trimmed[0] {
+	case '{':
+		return parseObjectShape(trimmed)
+	case '[':
+		segments, err := parseSegmentsArray(trimmed)
+		if err != nil {
+			return ParsedTranscript{}, err
+		}
+		return ParsedTranscript{
+			Shape:    ShapeBareSegmentsArray,
+			Segments: segments,
+		}, nil
+	default:
+		return ParsedTranscript{}, fmt.Errorf("normalize input must be a top-level object with \"segments\" or a top-level segment array")
+	}
+}
+
+func ensureSingleValue(decoder *json.Decoder) error {
+	var extra json.RawMessage
+	err := decoder.Decode(&extra)
+	if err == io.EOF {
+		return nil
+	}
+	if err == nil {
+		return fmt.Errorf("normalize input must contain exactly one top-level JSON value")
+	}
+	return fmt.Errorf("decode normalize input JSON: %w", err)
+}
+
+func parseObjectShape(raw []byte) (ParsedTranscript, error) {
+	var object map[string]json.RawMessage
+	if err := json.Unmarshal(raw, &object); err != nil {
+		return ParsedTranscript{}, fmt.Errorf("decode normalize object input: %w", err)
+	}
+
+	segmentsRaw, exists := object["segments"]
+	if !exists {
+		return ParsedTranscript{}, fmt.Errorf("normalize object input must contain a \"segments\" field")
+	}
+
+	segments, err := parseSegmentsArray(segmentsRaw)
+	if err != nil {
+		return ParsedTranscript{}, err
+	}
+
+	return ParsedTranscript{
+		Shape:    ShapeObjectWithSegments,
+		Segments: segments,
+	}, nil
+}
+
+func parseSegmentsArray(raw []byte) ([]InputSegment, error) {
+	var segmentValues []json.RawMessage
+	if err := json.Unmarshal(raw, &segmentValues); err != nil {
+		return nil, fmt.Errorf("normalize input \"segments\" must be an array")
+	}
+
+	segments := make([]InputSegment, len(segmentValues))
+	for index, segmentRaw := range segmentValues {
+		segment, err := parseSegment(index, segmentRaw)
+		if err != nil {
+			return nil, err
+		}
+		segments[index] = segment
+	}
+	return segments, nil
+}
+
+func parseSegment(index int, raw []byte) (InputSegment, error) {
+	var payload inputSegmentPayload
+	if err := json.Unmarshal(raw, &payload); err != nil {
+		return InputSegment{}, fmt.Errorf("segment %d: invalid segment object: %w", index, err)
+	}
+
+	if payload.Start == nil {
+		return InputSegment{}, fmt.Errorf("segment %d is missing required field \"start\"", index)
+	}
+	if payload.End == nil {
+		return InputSegment{}, fmt.Errorf("segment %d is missing required field \"end\"", index)
+	}
+	if payload.Speaker == nil {
+		return InputSegment{}, fmt.Errorf("segment %d is missing required field \"speaker\"", index)
+	}
+	if payload.Text == nil {
+		return InputSegment{}, fmt.Errorf("segment %d is missing required field \"text\"", index)
+	}
+
+	if *payload.Start < 0 {
+		return InputSegment{}, fmt.Errorf("segment %d has start %v; start must be >= 0", index, *payload.Start)
+	}
+	if *payload.End < *payload.Start {
+		return InputSegment{}, fmt.Errorf("segment %d has end %v before start %v", index, *payload.End, *payload.Start)
+	}
+
+	speaker := strings.TrimSpace(*payload.Speaker)
+	if speaker == "" {
+		return InputSegment{}, fmt.Errorf("segment %d has empty \"speaker\"; speaker must be non-empty", index)
+	}
+
+	return InputSegment{
+		InputIndex:         index,
+		OriginalID:         payload.ID,
+		Start:              *payload.Start,
+		End:                *payload.End,
+		Speaker:            speaker,
+		Text:               *payload.Text,
+		Categories:         append([]string(nil), payload.Categories...),
+		Source:             payload.Source,
+		SourceSegmentIndex: payload.SourceSegmentIndex,
+		SourceRef:          payload.SourceRef,
+		DerivedFrom:        append([]string(nil), payload.DerivedFrom...),
+		OverlapGroupID:     payload.OverlapGroupID,
+	}, nil
+}
--- a/internal/normalize/parse_test.go
+++ b/internal/normalize/parse_test.go
@@ -0,0 +1,181 @@
+package normalize
+
+import (
+	"strings"
+	"testing"
+)
+
+func TestParseReaderObjectWithSegmentsParses(t *testing.T) {
+	input := `{
+		"segments": [
+			{"start": 1.0, "end": 2.0, "speaker": " Alice ", "text": "hello", "id": 100}
+		]
+	}`
+
+	parsed, err := ParseReader(strings.NewReader(input))
+	if err != nil {
+		t.Fatalf("parse failed: %v", err)
+	}
+	if parsed.Shape != ShapeObjectWithSegments {
+		t.Fatalf("shape = %q, want %q", parsed.Shape, ShapeObjectWithSegments)
+	}
+	if len(parsed.Segments) != 1 {
+		t.Fatalf("segment count = %d, want 1", len(parsed.Segments))
+	}
+	segment := parsed.Segments[0]
+	if segment.Speaker != "Alice" {
+		t.Fatalf("speaker = %q, want %q", segment.Speaker, "Alice")
+	}
+	if segment.OriginalID == nil || *segment.OriginalID != 100 {
+		t.Fatalf("original id = %v, want 100", segment.OriginalID)
+	}
+}
+
+func TestParseReaderBareSegmentArrayParses(t *testing.T) {
+	input := `[
+		{"start": 1.0, "end": 2.0, "speaker": "Alice", "text": "hello"},
+		{"start": 3.0, "end": 4.0, "speaker": "Bob", "text": "world"}
+	]`
+
+	parsed, err := ParseReader(strings.NewReader(input))
+	if err != nil {
+		t.Fatalf("parse failed: %v", err)
+	}
+	if parsed.Shape != ShapeBareSegmentsArray {
+		t.Fatalf("shape = %q, want %q", parsed.Shape, ShapeBareSegmentsArray)
+	}
+	if len(parsed.Segments) != 2 {
+		t.Fatalf("segment count = %d, want 2", len(parsed.Segments))
+	}
+}
+
+func TestParseReaderInvalidJSONFails(t *testing.T) {
+	_, err := ParseReader(strings.NewReader(`{"segments":`))
+	if err == nil {
+		t.Fatal("expected parse error")
+	}
+	if !strings.Contains(err.Error(), "decode normalize input JSON") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func TestParseReaderObjectMissingSegmentsFails(t *testing.T) {
+	_, err := ParseReader(strings.NewReader(`{"items":[]}`))
+	if err == nil {
+		t.Fatal("expected missing segments error")
+	}
+	if !strings.Contains(err.Error(), "must contain a \"segments\" field") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func TestParseReaderSegmentsNotArrayFails(t *testing.T) {
+	_, err := ParseReader(strings.NewReader(`{"segments": {}}`))
+	if err == nil {
+		t.Fatal("expected segments not array error")
+	}
+	if !strings.Contains(err.Error(), "\"segments\" must be an array") {
+		t.Fatalf("unexpected error: %v", err)
+	}
+}
+
+func TestParseReaderTopLevelScalarShapesFail(t *testing.T) {
+	tests := []string{`"text"`, `42`, `null`, `true`}
+	for _, input := range tests {
+		_, err := ParseReader(strings.NewReader(input))
+		if err == nil {
+			t.Fatalf("expected top-level shape error for %s", input)
+		}
+		if !strings.Contains(err.Error(), "top-level object") {
+			t.Fatalf("unexpected error for %s: %v", input, err)
+		}
+	}
+}
+
+func TestParseReaderMissingStartFails(t *testing.T) {
+	_, err := ParseReader(strings.NewReader(`[{"end":2,"speaker":"A","text":"t"}]`))
+	assertContains(t, err, `missing required field "start"`)
+}
+
+func TestParseReaderMissingEndFails(t *testing.T) {
+	_, err := ParseReader(strings.NewReader(`[{"start":1,"speaker":"A","text":"t"}]`))
+	assertContains(t, err, `missing required field "end"`)
+}
+
+func TestParseReaderMissingSpeakerFails(t *testing.T) {
+	_, err := ParseReader(strings.NewReader(`[{"start":1,"end":2,"text":"t"}]`))
+	assertContains(t, err, `missing required field "speaker"`)
+}
+
+func TestParseReaderEmptySpeakerFails(t *testing.T) {
+	_, err := ParseReader(strings.NewReader(`[{"start":1,"end":2,"speaker":"   ","text":"t"}]`))
+	assertContains(t, err, `speaker must be non-empty`)
+}
+
+func TestParseReaderMissingTextFails(t *testing.T) {
+	_, err := ParseReader(strings.NewReader(`[{"start":1,"end":2,"speaker":"A"}]`))
+	assertContains(t, err, `missing required field "text"`)
+}
+
+func TestParseReaderEndBeforeStartFails(t *testing.T) {
+	_, err := ParseReader(strings.NewReader(`[{"start":3,"end":2,"speaker":"A","text":"t"}]`))
+	assertContains(t, err, "before start")
+}
+
+func TestParseReaderNegativeStartFails(t *testing.T) {
+	_, err := ParseReader(strings.NewReader(`[{"start":-1,"end":2,"speaker":"A","text":"t"}]`))
+	assertContains(t, err, "start must be >= 0")
+}
+
+func TestParseReaderEmptySegmentsArrayAccepted(t *testing.T) {
+	parsed, err := ParseReader(strings.NewReader(`{"segments":[]}`))
+	if err != nil {
+		t.Fatalf("parse failed: %v", err)
+	}
+	if len(parsed.Segments) != 0 {
+		t.Fatalf("segment count = %d, want 0", len(parsed.Segments))
+	}
+}
+
+func TestParseReaderCategoriesPreservedWhenValid(t *testing.T) {
+	parsed, err := ParseReader(strings.NewReader(`[{"start":1,"end":2,"speaker":"A","text":"t","categories":["filler","backchannel"]}]`))
+	if err != nil {
+		t.Fatalf("parse failed: %v", err)
+	}
+	if len(parsed.Segments) != 1 {
+		t.Fatalf("segment count = %d, want 1", len(parsed.Segments))
+	}
+	if len(parsed.Segments[0].Categories) != 2 {
+		t.Fatalf("categories length = %d, want 2", len(parsed.Segments[0].Categories))
+	}
+	if parsed.Segments[0].Categories[0] != "filler" || parsed.Segments[0].Categories[1] != "backchannel" {
+		t.Fatalf("categories = %v", parsed.Segments[0].Categories)
+	}
+}
+
+func TestParseReaderOriginalInputIndexPreserved(t *testing.T) {
+	input := `[
+		{"start":1,"end":2,"speaker":"A","text":"one"},
+		{"start":2,"end":3,"speaker":"B","text":"two"},
+		{"start":3,"end":4,"speaker":"C","text":"three"}
+	]`
+	parsed, err := ParseReader(strings.NewReader(input))
+	if err != nil {
+		t.Fatalf("parse failed: %v", err)
+	}
+	for index, segment := range parsed.Segments {
+		if segment.InputIndex != index {
+			t.Fatalf("segment %d input index = %d, want %d", index, segment.InputIndex, index)
+		}
+	}
+}
+
+func assertContains(t *testing.T, err error, fragment string) {
+	t.Helper()
+	if err == nil {
+		t.Fatalf("expected error containing %q", fragment)
+	}
+	if !strings.Contains(err.Error(), fragment) {
+		t.Fatalf("error = %q, want substring %q", err.Error(), fragment)
+	}
+}
Author	SHA1	Message	Date
Eric Rakestraw	6dbb7ab17e	Review normalize command architecture All checks were successful ci/woodpecker/tag/release Pipeline was successful Details	2026-05-09 12:38:06 +00:00
Eric Rakestraw	3591041fa8	Document normalize command	2026-05-09 12:35:48 +00:00
Eric Rakestraw	5b008e272c	Add normalize report diagnostics	2026-05-09 12:34:37 +00:00
Eric Rakestraw	6c780f6293	Implement normalize output conversion	2026-05-09 12:32:18 +00:00
Eric Rakestraw	c132f3fd5d	Add normalize input parsing	2026-05-09 12:29:12 +00:00
Eric Rakestraw	3679435063	Add normalize command scaffold	2026-05-09 12:26:47 +00:00