Implemented substring matching for speakers.yml
This commit is contained in:
52
README.md
52
README.md
@@ -74,22 +74,56 @@ Other WhisperX fields, including `words` and raw diarization speaker labels, are
|
||||
|
||||
## Speaker Map Format
|
||||
|
||||
`speakers.yml` maps each input file basename to one canonical speaker name:
|
||||
`speakers.yml` maps input files to canonical speaker names using ordered substring rules:
|
||||
|
||||
```yaml
|
||||
inputs:
|
||||
2026-04-19-Eric_Rakestraw.json:
|
||||
speaker: "Eric Rakestraw"
|
||||
match:
|
||||
- speaker: "Eric Rakestraw"
|
||||
match:
|
||||
- "Eric_Rakestraw"
|
||||
- "Eric"
|
||||
|
||||
2026-04-19-Mike_Brown.json:
|
||||
speaker: "Mike Brown"
|
||||
- speaker: "Mike Brown"
|
||||
match:
|
||||
- "Mike_Brown"
|
||||
- "mb"
|
||||
```
|
||||
|
||||
For each `--input-file`, `seriatim` takes the file basename and evaluates the rules in order. The first rule with a matching substring wins, and no later rules are evaluated.
|
||||
|
||||
For example, this input:
|
||||
|
||||
```text
|
||||
samples/raw/2026-04-19-Eric_Rakestraw.json
|
||||
```
|
||||
|
||||
matches this rule because the basename contains `Eric_Rakestraw`:
|
||||
|
||||
```yaml
|
||||
- speaker: "Eric Rakestraw"
|
||||
match:
|
||||
- "Eric_Rakestraw"
|
||||
```
|
||||
|
||||
Important details:
|
||||
|
||||
- Keys are matched against the basename of each `--input-file`, not the full path.
|
||||
- Every input file must have exactly one matching entry.
|
||||
- `speaker` is required and must be non-empty.
|
||||
- Matching is against the input file basename, not the full path.
|
||||
- Matching is case-insensitive.
|
||||
- Rules are evaluated from first to last.
|
||||
- Each rule must have a non-empty `speaker`.
|
||||
- Each rule must have at least one non-empty `match` string.
|
||||
- Duplicate speaker names are invalid.
|
||||
- Every input file must match at least one rule or the command fails.
|
||||
|
||||
Deprecated old format:
|
||||
|
||||
```yaml
|
||||
inputs:
|
||||
eric.json:
|
||||
speaker: "Eric Rakestraw"
|
||||
```
|
||||
|
||||
The old `inputs:` direct mapping format is no longer supported.
|
||||
|
||||
## Output JSON Format
|
||||
|
||||
|
||||
Reference in New Issue
Block a user