Architecture

Evaluator Hierarchy

All evaluators live in pathbench.evaluator as abstract base classes. The ABC a class inherits from defines exactly what inputs score() receives. When adding a new evaluator, pick the right ABC before writing any logic.

ABC

score() signature

Use for

LookupEvaluator

(utt_id)

Pre-computed scores

ReferenceFreeEvaluator

(utt_id, audio_path, start, end)

Audio-only metrics (CPP, SNR, …)

ReferenceTxtEvaluator

+ transcription, language

ASR/FA-based metrics

ReferenceAudioEvaluator

+ reference_audios

Reference comparison (NAD, ESTOI, …)

ReferenceTxtAndAudioEvaluator

+ transcription, language, reference_audios

FA-trimmed reference metrics

ReferenceFreeSpeakerEvaluator

_score_audio_list(audios)

Speaker-level aggregation

LanguageAwareSpeakerEvaluator

_score_audio_list(audios, language)

Speaker-level + language (VSA)

FA-Trimming: Decorator Pattern

Forced-alignment silence trimming is never baked into evaluators. Instead, wrappers in evaluator.py handle it:

The trimmer is FATrimmer in pathbench/vad.py. If trimming fails or a segment offset is specified, it falls back to plain librosa.load().

TrimmedNADEvaluator is an exception – it implements its own two-pass trimming logic directly, because the fallback must be group-consistent (all references fall back together).

Dataset Format

Each dataset directory uses Kaldi-style plain text files:

  • wav.scputt_id -> audio_file_path

  • textutt_id -> transcription

  • utt2spkutt_id -> speaker_id

  • segmentsutt_id -> recording_id start_time end_time (optional)

  • spk2scorespeaker_id -> float (ground truth; N/A for unavailable)

  • spk2genderspeaker_id -> m|f

  • language – single line, two-letter code (en, nl, it, es, cmn)

Dataset loads these and iterates as (utt_id, audio_path, transcription, ref_audio_list, start_time, end_time). Reference audio is matched by shared transcription text and, optionally, gender.