The following process is used to score different models on Orcasound test sets.
We use the AU-PRC metric - details are described in Methodology.
- Get test data with
download_datasets.py
from the orcaml repo by specifying the--only_test
flag. (For details about the test sets see Orcasound data wiki) - Run inference with your model and create a submission file following Submission Format
- Run
score.py
to get the results, as well as detailed Precision-Recall curves - You can score multiple submission files together, to easily compare different models
Your submission should contain a series of time-intervals, with associated confidence scores for each wav_filename
in the test set.
The intervals can have any duration, but should be non-overlapping and together cover the entire wav file. If any time intervals are unmarked, they are assumed to be zero confidence.
Times do not need to be too accurate: they are quantized to 1-second level precision. NOTE: you should not apply any thresholding - this is part of the scorer.
Specifically, we need a tsv with these columns:
- wav_filename - namesake
- start_time_s - specified relative to the audio file (in sec)
- duration_s - duration of the interval (in sec)
- confidence - confidence score, which is used to generate the AU-PRC metrics and curve
For an example, see submission/AudioSet-VGGish-R1to7.tsv
.
We quantize intervals from both the ground truth and the submission file into 1 second time windows.
If N:N+1 seconds contains a part of the interval, N is counted.
These quantized intervals are then treated as individual examples for generating the AU-PRC as the evaluation metric.
The AU-PRC is computed individually for each sub-dataset and a simple average is taken for the OVERALL
score.
This runs scoring for the baseline, and the current best model.
python score.py -testSetDir [DOWNLOAD_DATASETS] -submissionFiles "submission\Baseline-AudioSet-VGGish_R1to7.tsv,submission\FastAI-ResNet50_R1to7.tsv" -threshold (OPTIONAL)
A results.md
file containing a summary, au_pr_curves.png
containing plots and metrics.tsv
containing details is written to the directory containing the submission files.
If the optional -threshold
argument, is provided, precision/recall/F1 scores are also included in metrics.tsv
.
(Note: this is only for development purposes, the official metric is threshold-independent AUPRC)
This the the current state-of-art for models on the repo :)
dataset | Baseline-AudioSet-VGGish_R1to7 | FastAI-ResNet50_R1to7 | Baseline-AudioSet-VGGish_R1to12 | FastAI-ResNet50_R1to12 |
---|---|---|---|---|
OVERALL | 0.614 | 0.836 | 0.681 | 0.872 |
podcast_test_round1 | 0.949 | 0.979 | 0.939 | 0.977 |
podcast_test_round2 | 0.803 | 0.923 | 0.834 | 0.938 |
podcast_test_round3 | 0.09 | 0.605 | 0.269 | 0.700 |
NOTE: If you are deploying a new model (1) generate your submission file and score, comparing with existing files (2) update
/submission
and this README with your results (3) Upload your model (with similar naming convention) to folder and update links in the README.
NOTE: The FastAI-ResNet50_R1to12 model was trained with extra false positive data from the live system apart from the Round 1-12 training dataset.