525 lines
21 KiB
Markdown
525 lines
21 KiB
Markdown
# seriatim Architecture
|
|
|
|
`seriatim` is a deterministic transcript utility for:
|
|
|
|
- merging multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript, and
|
|
- projecting existing seriatim transcript artifacts through deterministic segment-ID trimming, and
|
|
- canonicalizing external transcript-style JSON inputs into standard seriatim output schemas.
|
|
|
|
The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events.
|
|
|
|
`seriatim` is implemented in Go.
|
|
|
|
## Goals
|
|
|
|
`seriatim` should:
|
|
|
|
1. Validate runtime configuration before performing transcript processing.
|
|
2. Support multiple input methods and formats through input readers.
|
|
3. Normalize raw per-speaker transcripts into a canonical internal model.
|
|
4. Apply deterministic preprocessing modules to canonical per-speaker transcripts.
|
|
5. Merge all segments into a deterministic global chronological order.
|
|
6. Apply deterministic postprocessing modules to the merged transcript.
|
|
7. Preserve word-level timing data when available.
|
|
8. Detect and annotate overlapping speech regions.
|
|
9. Emit one or more output artifacts through output writers.
|
|
10. Produce report data for validation findings, corrections, and transformations.
|
|
11. Support artifact-level transcript projection commands that operate on existing seriatim output.
|
|
|
|
## Non-goals
|
|
|
|
The 1.0 release does not attempt to:
|
|
|
|
- Perform transcription.
|
|
- Perform audio diarization.
|
|
- Use an LLM.
|
|
- Summarize transcript content.
|
|
- Infer speaker identity from audio or text.
|
|
- Fully resolve every crosstalk case.
|
|
- Load arbitrary third-party code as dynamic plugins.
|
|
|
|
The application supports runtime composition of built-in modules by canonical module name. Arbitrary external plugin loading can be considered later.
|
|
|
|
## Core Assumption
|
|
|
|
The merge algorithm assumes that all input transcript timestamps are measured against the same session clock.
|
|
|
|
This is expected when each speaker has a separate recording that preserves silence and starts at the same session recording time. If input files have independent local timelines, `seriatim` cannot safely merge them without a separate alignment step.
|
|
|
|
## Pipeline Overview
|
|
|
|
The internal pipeline is:
|
|
|
|
```text
|
|
configuration check
|
|
-> input
|
|
-> preprocessing
|
|
-> merge
|
|
-> postprocessing
|
|
-> output
|
|
```
|
|
|
|
Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations.
|
|
|
|
`merge` runs this pipeline. `trim` and `normalize` are intentionally separate from this pipeline and operate at the artifact layer.
|
|
|
|
## Stage Contracts
|
|
|
|
### 1. Configuration Check
|
|
|
|
The configuration stage validates all CLI flags, environment variables, module names, input paths, output paths, and module-specific options before transcript data is processed.
|
|
|
|
Configuration validation should fail fast for:
|
|
|
|
- Missing required input.
|
|
- Unknown module names.
|
|
- Unknown input or output formats.
|
|
- Ambiguous speaker mappings.
|
|
- Invalid correction policies.
|
|
- Invalid timing thresholds.
|
|
- Invalid output paths.
|
|
|
|
The configuration stage produces an application config value that is passed through the pipeline.
|
|
|
|
### 2. Input Stage
|
|
|
|
The input stage converts external inputs into raw transcript documents with source metadata.
|
|
|
|
The current input method is one or more JSON files passed with repeated `--input-file` flags:
|
|
|
|
```text
|
|
seriatim merge --input-file eric.json --input-file mike.json --output-file merged.json
|
|
```
|
|
|
|
Future input methods may include:
|
|
|
|
- A `.tar.gz` bundle.
|
|
- A URI.
|
|
- A directory.
|
|
|
|
Future input formats may include:
|
|
|
|
- JSON.
|
|
- SRT.
|
|
- VTT.
|
|
|
|
Input readers should be selected from an explicit registry. A reader is responsible for loading external data and returning raw transcript documents, not for canonical normalization.
|
|
|
|
### 3. Preprocessing Stage
|
|
|
|
The preprocessing stage applies zero or more modules before global merge.
|
|
|
|
Preprocessing starts with raw transcript documents from input readers and must end with canonical per-speaker transcripts. Some preprocessing modules operate on raw transcripts, some perform raw-to-canonical normalization, and some operate only on canonical transcripts.
|
|
|
|
Preprocessing modules are selected at runtime with a comma-separated list of canonical module names:
|
|
|
|
```text
|
|
--preprocessing-modules validate-raw,normalize-speakers,trim-text
|
|
```
|
|
|
|
Modules run in the exact order provided. Unknown module names are configuration errors.
|
|
|
|
Potential preprocessing modules include:
|
|
|
|
- Structural raw transcript validation.
|
|
- Semantic transcript validation.
|
|
- Raw-to-canonical transcript normalization.
|
|
- Speaker name normalization based on input filename.
|
|
- Timing validation and deterministic correction.
|
|
- Text trimming.
|
|
|
|
Preprocessing should not depend on global chronological ordering across speakers. Modules that need the globally merged transcript belong in postprocessing.
|
|
|
|
Each preprocessing module must declare the model state it requires and the model state it produces. For example, `validate-raw` requires raw transcripts and produces raw transcripts, while `normalize-speakers` requires raw transcripts and produces canonical transcripts. Configuration validation should reject module orders that cannot type-check.
|
|
|
|
### 4. Merge Stage
|
|
|
|
The merge stage extracts all canonical segments from the preprocessed per-speaker transcripts and sorts them into a single deterministic chronological sequence.
|
|
|
|
The recommended sort key is:
|
|
|
|
```text
|
|
(start, end, source, source_segment_index, speaker)
|
|
```
|
|
|
|
The exact tie-breaker must be documented and stable across runs.
|
|
|
|
The merge stage should assign temporary internal references if needed, but it should not assign final output IDs until after all order-affecting postprocessing is complete.
|
|
|
|
### 5. Postprocessing Stage
|
|
|
|
The postprocessing stage applies zero or more modules to the merged transcript.
|
|
|
|
Postprocessing modules are selected at runtime with a comma-separated list of canonical module names:
|
|
|
|
```text
|
|
--postprocessing-modules detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output
|
|
```
|
|
|
|
Modules run in the exact order provided. Unknown module names are configuration errors.
|
|
|
|
Potential postprocessing modules include:
|
|
|
|
- Overlap group detection.
|
|
- Overlap group refinement.
|
|
- Same-speaker segment coalescing.
|
|
- Deterministic grammar cleanup.
|
|
- Word replacement from `autocorrect.yml`.
|
|
- Final segment ID assignment.
|
|
- Output model validation.
|
|
|
|
Any module that can reorder, split, merge, drop, or create segments must run before final ID assignment.
|
|
|
|
### 6. Output Stage
|
|
|
|
The output stage emits one or more artifacts from the final transcript and report model.
|
|
|
|
The current output format is JSON, specified with:
|
|
|
|
```text
|
|
--output-file merged.json
|
|
```
|
|
|
|
The current named JSON schemas are:
|
|
|
|
- `seriatim-minimal`
|
|
- `seriatim-intermediate`
|
|
- `seriatim-full`
|
|
|
|
The current runtime default selection is `seriatim-intermediate`, but default selection may change over time. Consumers that depend on a specific schema should request it explicitly.
|
|
|
|
Future output formats may include:
|
|
|
|
- Markdown.
|
|
- SRT.
|
|
- VTT.
|
|
- Validation reports.
|
|
- Overlap reports.
|
|
|
|
Output writers should be selected from an explicit registry and should consume the final transcript model read-only. Multiple output writers may run for a single invocation.
|
|
|
|
### 7. Artifact Projection Stage (`trim` command)
|
|
|
|
`trim` is an artifact-level command that reads an existing seriatim output artifact and emits a projected artifact containing a segment-ID subset.
|
|
|
|
Design constraints:
|
|
|
|
- `trim` runs after `merge`, not as a merge postprocessor.
|
|
- `trim` validates the input artifact against supported seriatim output schemas.
|
|
- `trim` performs deterministic keep/remove selection by segment ID.
|
|
- `trim` renumbers retained IDs to `1..N` in transcript order.
|
|
- `trim` validates the final output against the selected output schema before writing.
|
|
- `trim` records audit metadata in report output.
|
|
|
|
`trim` is intentionally separate from merge postprocessing because it consumes already-emitted public artifacts. This separation keeps merge semantics stable and avoids rerunning merge-only transforms on projected artifacts.
|
|
|
|
`trim` must not rerun merge postprocessors such as `resolve-overlaps`, `coalesce`, or `autocorrect`.
|
|
|
|
### 8. Artifact Canonicalization Stage (`normalize` command)
|
|
|
|
`normalize` is an artifact-level command that reads transcript-like JSON and emits a standard seriatim output artifact in a selected schema.
|
|
|
|
Design constraints:
|
|
|
|
- `normalize` runs outside the merge pipeline and does not invoke merge preprocessing or postprocessing modules.
|
|
- `normalize` accepts two input shapes: object-with-`segments` and bare segment arrays.
|
|
- `normalize` validates required segment fields (`start`, `end`, `speaker`, `text`) and timing/speaker constraints.
|
|
- `normalize` sorts segments deterministically by chronological keys and stable input-index tie-breakers.
|
|
- `normalize` assigns fresh sequential output IDs (`1..N`) after sorting.
|
|
- `normalize` validates final output against the selected schema before writing.
|
|
- `normalize` writes optional deterministic report diagnostics when `--report-file` is requested.
|
|
|
|
`normalize` is intended for canonicalizing external transcript outputs (including Audita-style bare arrays) into seriatim contracts, not for running merge-time language or overlap transformations.
|
|
|
|
`normalize` must not run merge postprocessors such as overlap detection, overlap resolution, coalescing, or autocorrect.
|
|
|
|
## Module Classification
|
|
|
|
Modules should be classified by their contract and allowed effects.
|
|
|
|
| Class | Input | Output | Allowed effects |
|
|
| --- | --- | --- | --- |
|
|
| `InputReader` | External source spec | Raw transcript documents | Reads external data |
|
|
| `Validator` | Raw, canonical, merged, or final model | Same model plus report events | Observes only |
|
|
| `Normalizer` | Raw model | Canonical model | Converts representation |
|
|
| `Corrector` | Canonical model | Canonical model plus report events | Deterministic mutation |
|
|
| `Annotator` | Canonical or merged model | Same model plus annotations | Adds metadata |
|
|
| `Transformer` | Canonical or merged model | Updated model plus report events | May reorder, split, merge, drop, or create segments |
|
|
| `OutputWriter` | Final transcript and report | External artifact | Writes output |
|
|
|
|
This classification should guide Go interfaces and package boundaries. It should also determine where a module is allowed to run.
|
|
|
|
## Runtime Module Composition
|
|
|
|
The application supports runtime composition of built-in modules.
|
|
|
|
Module names are canonical strings registered at startup. CLI flags refer to those names. The configuration stage resolves names into module instances before the pipeline runs.
|
|
|
|
Example:
|
|
|
|
```text
|
|
seriatim merge \
|
|
--input-file eric.json \
|
|
--input-file mike.json \
|
|
--speakers speakers.yml \
|
|
--autocorrect autocorrect.yml \
|
|
--preprocessing-modules validate-raw,normalize-speakers,trim-text \
|
|
--postprocessing-modules detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output \
|
|
--output-modules json \
|
|
--output-schema seriatim-intermediate \
|
|
--output-file merged.json \
|
|
--report-file report.json
|
|
```
|
|
|
|
Composition rules:
|
|
|
|
- Module order is exactly the order specified by the user.
|
|
- An empty module list is valid when the stage supports zero modules.
|
|
- Unknown module names are fatal configuration errors.
|
|
- Module-specific options are read from the validated application config.
|
|
- A module must declare which pipeline stage and model type it supports.
|
|
- Modules should be deterministic for the same inputs, config, and application version.
|
|
- Modules should not perform I/O unless their class explicitly allows it.
|
|
|
|
Some modules may be recommended defaults. Defaults should be explicit in documentation and should be equivalent to passing the corresponding module list.
|
|
|
|
## Go Interface Sketch
|
|
|
|
The exact implementation may evolve, but the core interfaces should resemble:
|
|
|
|
```go
|
|
type InputReader interface {
|
|
Name() string
|
|
Read(ctx context.Context, spec InputSpec, cfg Config) ([]RawTranscript, []ReportEvent, error)
|
|
}
|
|
|
|
type Preprocessor interface {
|
|
Name() string
|
|
Requires() ModelState
|
|
Produces() ModelState
|
|
Process(ctx context.Context, in PreprocessState, cfg Config) (PreprocessState, []ReportEvent, error)
|
|
}
|
|
|
|
type Merger interface {
|
|
Merge(ctx context.Context, in []CanonicalTranscript, cfg Config) (MergedTranscript, []ReportEvent, error)
|
|
}
|
|
|
|
type Postprocessor interface {
|
|
Name() string
|
|
Process(ctx context.Context, in MergedTranscript, cfg Config) (MergedTranscript, []ReportEvent, error)
|
|
}
|
|
|
|
type OutputWriter interface {
|
|
Name() string
|
|
Write(ctx context.Context, out any, report Report, cfg Config) ([]ReportEvent, error)
|
|
}
|
|
```
|
|
|
|
`PreprocessState` should carry either raw transcripts, canonical transcripts, or both during migration between representations. The pipeline should validate that the ordered preprocessing list transitions from raw input state to canonical output state exactly once before merge.
|
|
|
|
The interfaces should favor value returns over hidden mutation. If pointer-based implementations are chosen for performance, mutation boundaries must still be clear and tested.
|
|
|
|
## Canonical Internal Model
|
|
|
|
The canonical model should be richer than the final output schema.
|
|
|
|
Canonical segment fields should include:
|
|
|
|
- Temporary internal reference.
|
|
- Source identifier.
|
|
- Source segment index.
|
|
- Canonical speaker.
|
|
- Start time.
|
|
- End time.
|
|
- Text.
|
|
- Word-level timing data, if available.
|
|
- Raw diarization labels, if useful for reporting.
|
|
- Validation and correction metadata, if needed internally.
|
|
|
|
The final output model can omit internal-only fields, but the report should retain enough provenance to diagnose corrections and transformations.
|
|
|
|
## Validation Strategy
|
|
|
|
Validation occurs at multiple boundaries:
|
|
|
|
- Configuration validation before processing.
|
|
- Raw input structural validation after input loading.
|
|
- Raw input semantic validation before normalization or correction.
|
|
- Canonical model validation after normalization and preprocessing.
|
|
- Merged model validation after merge and postprocessing.
|
|
- Final output schema validation before writing artifacts.
|
|
|
|
Structural validation answers whether data has the required shape and types.
|
|
|
|
Semantic validation answers whether the data is plausible and internally consistent.
|
|
|
|
Correctable issues should be deterministic and reportable. Fatal issues should stop the run with a non-zero exit code.
|
|
|
|
Examples of correctable issues:
|
|
|
|
- Leading or trailing whitespace.
|
|
- Segment `end < start`, when configured correction policy allows deterministic repair.
|
|
- Missing word speaker labels when canonical speaker is known.
|
|
- Raw diarization labels that should be replaced with the canonical speaker.
|
|
|
|
Examples of fatal issues:
|
|
|
|
- Input file is not valid JSON.
|
|
- Required transcript fields are missing.
|
|
- Speaker map does not identify a canonical speaker for an input.
|
|
- Unknown module name.
|
|
- Output fails final schema validation.
|
|
|
|
## Overlap Handling
|
|
|
|
Overlap detection should create overlap groups rather than only pairwise annotations.
|
|
|
|
Two adjacent sorted segments overlap when:
|
|
|
|
```text
|
|
next.start < current_group_end
|
|
```
|
|
|
|
This supports transitive overlap groups:
|
|
|
|
```text
|
|
A: 10.0-14.0
|
|
B: 12.0-13.0
|
|
C: 13.5-15.0
|
|
```
|
|
|
|
These belong to one overlap group spanning `10.0-15.0`.
|
|
|
|
Overlap groups should record:
|
|
|
|
- Overlap group ID.
|
|
- Group start time.
|
|
- Group end time.
|
|
- Segment references.
|
|
- Speakers involved.
|
|
- Classification, if known.
|
|
- Resolution status.
|
|
|
|
Initial classifications may include:
|
|
|
|
- `unknown`
|
|
- `minor_overlap`
|
|
- `handoff`
|
|
- `backchannel`
|
|
- `crosstalk`
|
|
|
|
The `resolve-overlaps` module uses preserved word-level timing to replace detected overlap-group segments with smaller word-run segments when usable timing is available. Resolution expands each overlap window by the configured coalesce gap so nearby same-speaker context can be absorbed into the replacement runs. Once a segment is selected for replacement, all timed words from that segment participate in word-run construction so text is not clipped at the window boundary. Groups without usable word timing remain unresolved for later passes or human review.
|
|
|
|
Overlap resolution should be non-destructive. Original segment text, timing, and source metadata must remain recoverable.
|
|
|
|
## Final ID Assignment
|
|
|
|
Final segment IDs should be assigned by an explicit postprocessing module after every transformation that can affect segment order.
|
|
|
|
Final IDs should be sequential integers starting from `1`.
|
|
|
|
Final IDs should reflect final chronological order.
|
|
|
|
Before final ID assignment, modules should reference segments using stable internal references rather than final output IDs.
|
|
|
|
## Output Invariants
|
|
|
|
A valid merged transcript should satisfy:
|
|
|
|
- Every segment has a unique integer ID.
|
|
- Segment IDs begin at `1`.
|
|
- Segment IDs increase in final chronological order.
|
|
- Every segment has a canonical speaker.
|
|
- Every segment has a source.
|
|
- Every segment has `start >= 0`.
|
|
- Every segment has `end >= start`.
|
|
- The segments array is sorted deterministically.
|
|
- Any `overlap_group_id` on a segment refers to an existing overlap group.
|
|
- Every overlap group references at least two segments.
|
|
- Every referenced segment exists.
|
|
- Output validates against the selected output schema.
|
|
|
|
For full-schema trim output, overlap groups are recomputed from retained segments so overlap annotations and group references remain internally consistent after projection.
|
|
|
|
## Determinism Requirements
|
|
|
|
Given the same inputs, config, and application version, `seriatim` should produce byte-stable JSON output where practical.
|
|
|
|
To support this:
|
|
|
|
- Sort input specs deterministically unless explicit input order is meaningful.
|
|
- Use stable sort keys.
|
|
- Assign final IDs only after final ordering.
|
|
- Avoid Go map iteration order affecting output.
|
|
- Emit JSON through structs with stable field ordering.
|
|
- Record application version in output metadata.
|
|
- Record enabled module names and module order in output metadata or report data.
|
|
|
|
Trim-specific determinism requirements:
|
|
|
|
- Selector normalization and retained IDs are deterministic.
|
|
- Old-to-new ID mapping in trim reports is emitted in deterministic order.
|
|
- Full-schema overlap recomputation is deterministic for the same input artifact and selector.
|
|
|
|
Normalize-specific determinism requirements:
|
|
|
|
- Input-shape detection is deterministic.
|
|
- Segment ordering is deterministic for identical input data.
|
|
- Output IDs are always reassigned sequentially after deterministic sorting.
|
|
- Normalize diagnostic reports are deterministic for identical inputs and configuration.
|
|
|
|
## Go Package Layout
|
|
|
|
```text
|
|
cmd/seriatim/ CLI entrypoint
|
|
internal/config/ CLI/env/config loading and validation
|
|
internal/pipeline/ Pipeline orchestration and module registry
|
|
internal/builtin/ Built-in pipeline modules
|
|
internal/artifact/ Conversion from internal model to public output schema
|
|
internal/normalize/ Normalize input parsing, validation, deterministic sorting, schema conversion, and diagnostics
|
|
internal/trim/ Artifact parsing, trim selection, schema conversion, overlap recomputation for full schema
|
|
internal/buildinfo/ Build-time version metadata
|
|
internal/speaker/ Speaker map parsing and lookup
|
|
internal/model/ Canonical and merged transcript models
|
|
internal/overlap/ Overlap detection and refinement helpers
|
|
internal/autocorrect/ Word replacement rules
|
|
internal/report/ Report model and event accumulation
|
|
schema/ Public output contract and JSON Schema validation
|
|
```
|
|
|
|
Package boundaries should follow data ownership. Shared models belong in `internal/model`; stage-specific behavior belongs in the relevant stage package.
|
|
|
|
For trim:
|
|
|
|
- `internal/trim` contains pure transformation logic over artifact structs.
|
|
- CLI command code handles only flag parsing, file I/O, and report emission.
|
|
- Transform logic is deterministic and pure except for command-layer I/O.
|
|
|
|
For normalize:
|
|
|
|
- `internal/normalize` contains parsing/validation and deterministic schema conversion logic.
|
|
- CLI command code handles flag parsing and delegates execution.
|
|
- Normalize remains artifact-level and does not compose merge pipeline modules.
|
|
|
|
## Default Modules
|
|
|
|
The default pipeline is equivalent to explicit module lists.
|
|
|
|
Recommended default preprocessing modules:
|
|
|
|
```text
|
|
validate-raw,normalize-speakers,trim-text
|
|
```
|
|
|
|
Recommended default postprocessing modules:
|
|
|
|
```text
|
|
detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output
|
|
```
|
|
|
|
The default output module is:
|
|
|
|
```text
|
|
json
|
|
```
|