# seriatim Architecture `seriatim` is a deterministic transcript merge utility for combining multiple per-speaker transcript inputs into a single chronologically ordered diarized transcript. The initial use case is merging independently transcribed speaker audio tracks from the same recorded session, such as a weekly tabletop RPG session. The architecture should also support meetings, podcasts, interviews, and other multi-speaker events. `seriatim` is implemented in Go. ## Goals `seriatim` should: 1. Validate runtime configuration before performing transcript processing. 2. Support multiple input methods and formats through input readers. 3. Normalize raw per-speaker transcripts into a canonical internal model. 4. Apply deterministic preprocessing modules to canonical per-speaker transcripts. 5. Merge all segments into a deterministic global chronological order. 6. Apply deterministic postprocessing modules to the merged transcript. 7. Preserve word-level timing data when available. 8. Detect and annotate overlapping speech regions. 9. Emit one or more output artifacts through output writers. 10. Produce report data for validation findings, corrections, and transformations. ## Non-goals The 1.0 release does not attempt to: - Perform transcription. - Perform audio diarization. - Use an LLM. - Summarize transcript content. - Infer speaker identity from audio or text. - Fully resolve every crosstalk case. - Load arbitrary third-party code as dynamic plugins. The application supports runtime composition of built-in modules by canonical module name. Arbitrary external plugin loading can be considered later. ## Core Assumption The merge algorithm assumes that all input transcript timestamps are measured against the same session clock. This is expected when each speaker has a separate recording that preserves silence and starts at the same session recording time. If input files have independent local timelines, `seriatim` cannot safely merge them without a separate alignment step. ## Pipeline Overview The internal pipeline is: ```text configuration check -> input -> preprocessing -> merge -> postprocessing -> output ``` Each stage has an explicit data contract. Input and output stages perform I/O. Processing stages should be deterministic transformations over in-memory models and should record report events for validation findings, corrections, and transformations. ## Stage Contracts ### 1. Configuration Check The configuration stage validates all CLI flags, environment variables, module names, input paths, output paths, and module-specific options before transcript data is processed. Configuration validation should fail fast for: - Missing required input. - Unknown module names. - Unknown input or output formats. - Ambiguous speaker mappings. - Invalid correction policies. - Invalid timing thresholds. - Invalid output paths. The configuration stage produces an application config value that is passed through the pipeline. ### 2. Input Stage The input stage converts external inputs into raw transcript documents with source metadata. The current input method is one or more JSON files passed with repeated `--input-file` flags: ```text seriatim merge --input-file eric.json --input-file mike.json --output-file merged.json ``` Future input methods may include: - A `.tar.gz` bundle. - A URI. - A directory. Future input formats may include: - JSON. - SRT. - VTT. Input readers should be selected from an explicit registry. A reader is responsible for loading external data and returning raw transcript documents, not for canonical normalization. ### 3. Preprocessing Stage The preprocessing stage applies zero or more modules before global merge. Preprocessing starts with raw transcript documents from input readers and must end with canonical per-speaker transcripts. Some preprocessing modules operate on raw transcripts, some perform raw-to-canonical normalization, and some operate only on canonical transcripts. Preprocessing modules are selected at runtime with a comma-separated list of canonical module names: ```text --preprocessing-modules validate-raw,normalize-speakers,trim-text ``` Modules run in the exact order provided. Unknown module names are configuration errors. Potential preprocessing modules include: - Structural raw transcript validation. - Semantic transcript validation. - Raw-to-canonical transcript normalization. - Speaker name normalization based on input filename. - Timing validation and deterministic correction. - Text trimming. Preprocessing should not depend on global chronological ordering across speakers. Modules that need the globally merged transcript belong in postprocessing. Each preprocessing module must declare the model state it requires and the model state it produces. For example, `validate-raw` requires raw transcripts and produces raw transcripts, while `normalize-speakers` requires raw transcripts and produces canonical transcripts. Configuration validation should reject module orders that cannot type-check. ### 4. Merge Stage The merge stage extracts all canonical segments from the preprocessed per-speaker transcripts and sorts them into a single deterministic chronological sequence. The recommended sort key is: ```text (start, end, source, source_segment_index, speaker) ``` The exact tie-breaker must be documented and stable across runs. The merge stage should assign temporary internal references if needed, but it should not assign final output IDs until after all order-affecting postprocessing is complete. ### 5. Postprocessing Stage The postprocessing stage applies zero or more modules to the merged transcript. Postprocessing modules are selected at runtime with a comma-separated list of canonical module names: ```text --postprocessing-modules detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output ``` Modules run in the exact order provided. Unknown module names are configuration errors. Potential postprocessing modules include: - Overlap group detection. - Overlap group refinement. - Same-speaker segment coalescing. - Deterministic grammar cleanup. - Word replacement from `autocorrect.yml`. - Final segment ID assignment. - Output model validation. Any module that can reorder, split, merge, drop, or create segments must run before final ID assignment. ### 6. Output Stage The output stage emits one or more artifacts from the final transcript and report model. The current output format is JSON, specified with: ```text --output-file merged.json ``` Future output formats may include: - Markdown. - SRT. - VTT. - Validation reports. - Overlap reports. Output writers should be selected from an explicit registry and should consume the final transcript model read-only. Multiple output writers may run for a single invocation. ## Module Classification Modules should be classified by their contract and allowed effects. | Class | Input | Output | Allowed effects | | --- | --- | --- | --- | | `InputReader` | External source spec | Raw transcript documents | Reads external data | | `Validator` | Raw, canonical, merged, or final model | Same model plus report events | Observes only | | `Normalizer` | Raw model | Canonical model | Converts representation | | `Corrector` | Canonical model | Canonical model plus report events | Deterministic mutation | | `Annotator` | Canonical or merged model | Same model plus annotations | Adds metadata | | `Transformer` | Canonical or merged model | Updated model plus report events | May reorder, split, merge, drop, or create segments | | `OutputWriter` | Final transcript and report | External artifact | Writes output | This classification should guide Go interfaces and package boundaries. It should also determine where a module is allowed to run. ## Runtime Module Composition The application supports runtime composition of built-in modules. Module names are canonical strings registered at startup. CLI flags refer to those names. The configuration stage resolves names into module instances before the pipeline runs. Example: ```text seriatim merge \ --input-file eric.json \ --input-file mike.json \ --speakers speakers.yml \ --autocorrect autocorrect.yml \ --preprocessing-modules validate-raw,normalize-speakers,trim-text \ --postprocessing-modules detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output \ --output-modules json \ --output-schema seriatim \ --output-file merged.json \ --report-file report.json ``` Composition rules: - Module order is exactly the order specified by the user. - An empty module list is valid when the stage supports zero modules. - Unknown module names are fatal configuration errors. - Module-specific options are read from the validated application config. - A module must declare which pipeline stage and model type it supports. - Modules should be deterministic for the same inputs, config, and application version. - Modules should not perform I/O unless their class explicitly allows it. Some modules may be recommended defaults. Defaults should be explicit in documentation and should be equivalent to passing the corresponding module list. ## Go Interface Sketch The exact implementation may evolve, but the core interfaces should resemble: ```go type InputReader interface { Name() string Read(ctx context.Context, spec InputSpec, cfg Config) ([]RawTranscript, []ReportEvent, error) } type Preprocessor interface { Name() string Requires() ModelState Produces() ModelState Process(ctx context.Context, in PreprocessState, cfg Config) (PreprocessState, []ReportEvent, error) } type Merger interface { Merge(ctx context.Context, in []CanonicalTranscript, cfg Config) (MergedTranscript, []ReportEvent, error) } type Postprocessor interface { Name() string Process(ctx context.Context, in MergedTranscript, cfg Config) (MergedTranscript, []ReportEvent, error) } type OutputWriter interface { Name() string Write(ctx context.Context, out any, report Report, cfg Config) ([]ReportEvent, error) } ``` `PreprocessState` should carry either raw transcripts, canonical transcripts, or both during migration between representations. The pipeline should validate that the ordered preprocessing list transitions from raw input state to canonical output state exactly once before merge. The interfaces should favor value returns over hidden mutation. If pointer-based implementations are chosen for performance, mutation boundaries must still be clear and tested. ## Canonical Internal Model The canonical model should be richer than the final output schema. Canonical segment fields should include: - Temporary internal reference. - Source identifier. - Source segment index. - Canonical speaker. - Start time. - End time. - Text. - Word-level timing data, if available. - Raw diarization labels, if useful for reporting. - Validation and correction metadata, if needed internally. The final output model can omit internal-only fields, but the report should retain enough provenance to diagnose corrections and transformations. ## Validation Strategy Validation occurs at multiple boundaries: - Configuration validation before processing. - Raw input structural validation after input loading. - Raw input semantic validation before normalization or correction. - Canonical model validation after normalization and preprocessing. - Merged model validation after merge and postprocessing. - Final output schema validation before writing artifacts. Structural validation answers whether data has the required shape and types. Semantic validation answers whether the data is plausible and internally consistent. Correctable issues should be deterministic and reportable. Fatal issues should stop the run with a non-zero exit code. Examples of correctable issues: - Leading or trailing whitespace. - Segment `end < start`, when configured correction policy allows deterministic repair. - Missing word speaker labels when canonical speaker is known. - Raw diarization labels that should be replaced with the canonical speaker. Examples of fatal issues: - Input file is not valid JSON. - Required transcript fields are missing. - Speaker map does not identify a canonical speaker for an input. - Unknown module name. - Output fails final schema validation. ## Overlap Handling Overlap detection should create overlap groups rather than only pairwise annotations. Two adjacent sorted segments overlap when: ```text next.start < current_group_end ``` This supports transitive overlap groups: ```text A: 10.0-14.0 B: 12.0-13.0 C: 13.5-15.0 ``` These belong to one overlap group spanning `10.0-15.0`. Overlap groups should record: - Overlap group ID. - Group start time. - Group end time. - Segment references. - Speakers involved. - Classification, if known. - Resolution status. Initial classifications may include: - `unknown` - `minor_overlap` - `handoff` - `backchannel` - `crosstalk` The `resolve-overlaps` module uses preserved word-level timing to replace detected overlap-group segments with smaller word-run segments when usable timing is available. Resolution expands each overlap window by the configured coalesce gap so nearby same-speaker context can be absorbed into the replacement runs. Groups without usable word timing remain unresolved for later passes or human review. Overlap resolution should be non-destructive. Original segment text, timing, and source metadata must remain recoverable. ## Final ID Assignment Final segment IDs should be assigned by an explicit postprocessing module after every transformation that can affect segment order. Final IDs should be sequential integers starting from `1`. Final IDs should reflect final chronological order. Before final ID assignment, modules should reference segments using stable internal references rather than final output IDs. ## Output Invariants A valid merged transcript should satisfy: - Every segment has a unique integer ID. - Segment IDs begin at `1`. - Segment IDs increase in final chronological order. - Every segment has a canonical speaker. - Every segment has a source. - Every segment has `start >= 0`. - Every segment has `end >= start`. - The segments array is sorted deterministically. - Any `overlap_group_id` on a segment refers to an existing overlap group. - Every overlap group references at least two segments. - Every referenced segment exists. - Output validates against the selected output schema. ## Determinism Requirements Given the same inputs, config, and application version, `seriatim` should produce byte-stable JSON output where practical. To support this: - Sort input specs deterministically unless explicit input order is meaningful. - Use stable sort keys. - Assign final IDs only after final ordering. - Avoid Go map iteration order affecting output. - Emit JSON through structs with stable field ordering. - Record application version in output metadata. - Record enabled module names and module order in output metadata or report data. ## Go Package Layout ```text cmd/seriatim/ CLI entrypoint internal/config/ CLI/env/config loading and validation internal/pipeline/ Pipeline orchestration and module registry internal/builtin/ Built-in pipeline modules internal/artifact/ Conversion from internal model to public output schema internal/buildinfo/ Build-time version metadata internal/speaker/ Speaker map parsing and lookup internal/model/ Canonical and merged transcript models internal/overlap/ Overlap detection and refinement helpers internal/autocorrect/ Word replacement rules internal/report/ Report model and event accumulation schema/ Public output contract and JSON Schema validation ``` Package boundaries should follow data ownership. Shared models belong in `internal/model`; stage-specific behavior belongs in the relevant stage package. ## Default Modules The default pipeline is equivalent to explicit module lists. Recommended default preprocessing modules: ```text validate-raw,normalize-speakers,trim-text ``` Recommended default postprocessing modules: ```text detect-overlaps,resolve-overlaps,backchannel,filler,resolve-danglers,coalesce,detect-overlaps,autocorrect,assign-ids,validate-output ``` The default output module is: ```text json ```