Implemented substring matching for speakers.yml

This commit is contained in:
2026-04-26 19:20:00 -05:00
parent fe00600762
commit 99d0c425d6
4 changed files with 308 additions and 74 deletions

View File

@@ -74,22 +74,56 @@ Other WhisperX fields, including `words` and raw diarization speaker labels, are
## Speaker Map Format
`speakers.yml` maps each input file basename to one canonical speaker name:
`speakers.yml` maps input files to canonical speaker names using ordered substring rules:
```yaml
inputs:
2026-04-19-Eric_Rakestraw.json:
speaker: "Eric Rakestraw"
match:
- speaker: "Eric Rakestraw"
match:
- "Eric_Rakestraw"
- "Eric"
2026-04-19-Mike_Brown.json:
speaker: "Mike Brown"
- speaker: "Mike Brown"
match:
- "Mike_Brown"
- "mb"
```
For each `--input-file`, `seriatim` takes the file basename and evaluates the rules in order. The first rule with a matching substring wins, and no later rules are evaluated.
For example, this input:
```text
samples/raw/2026-04-19-Eric_Rakestraw.json
```
matches this rule because the basename contains `Eric_Rakestraw`:
```yaml
- speaker: "Eric Rakestraw"
match:
- "Eric_Rakestraw"
```
Important details:
- Keys are matched against the basename of each `--input-file`, not the full path.
- Every input file must have exactly one matching entry.
- `speaker` is required and must be non-empty.
- Matching is against the input file basename, not the full path.
- Matching is case-insensitive.
- Rules are evaluated from first to last.
- Each rule must have a non-empty `speaker`.
- Each rule must have at least one non-empty `match` string.
- Duplicate speaker names are invalid.
- Every input file must match at least one rule or the command fails.
Deprecated old format:
```yaml
inputs:
eric.json:
speaker: "Eric Rakestraw"
```
The old `inputs:` direct mapping format is no longer supported.
## Output JSON Format