Skip to content

Commit

Permalink
MRG: #216 from vocalpy/add-revise-vignettes
Browse files Browse the repository at this point in the history
Add / revise vignettes
  • Loading branch information
NickleDave authored Jan 28, 2023
2 parents 9d435f6 + bdc578b commit f36a493
Show file tree
Hide file tree
Showing 14 changed files with 1,506 additions and 124 deletions.
20 changes: 17 additions & 3 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,7 @@
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'
source_suffix = ['.rst', '.md']

# The master toctree document.
master_doc = 'index'
Expand All @@ -84,7 +83,7 @@
pygments_style = None

myst_enable_extensions = [
# "dollarmath",
"dollarmath",
# "amsmath",
# "deflist",
# "html_admonition",
Expand All @@ -111,6 +110,7 @@
#
html_theme_options = {
"logo_only": True,
"show_toc_level": 1,
}

# Add any paths that contain custom static files (such as style sheets) here,
Expand Down Expand Up @@ -217,6 +217,20 @@
"pandera": ("https://pandera.readthedocs.io/en/stable/", None)
}

# -- Options for nitpicky mode

# ensure that all references in the docs resolve.
nitpicky = True
nitpick_ignore = []

for line in open('nitpick-ignore.txt'):
if line.strip() == "" or line.startswith("#"):
continue
dtype, target = line.split(None, 1)
target = target.strip()
nitpick_ignore.append((dtype, target))


# -- Options for todo extension ----------------------------------------------

# If true, `todo` and `todoList` produce output, else they produce nothing.
Expand Down
Empty file.
2 changes: 2 additions & 0 deletions doc/howto.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,6 @@ This section shows you how to use crowsetta for specific tasks.
howto/howto-user-format
howto/convert-generic-seq
howto/convert-simple-seq
howto/remove-silent-labels-textgrid
```
180 changes: 124 additions & 56 deletions doc/howto/convert-generic-seq.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,48 @@ jupytext:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.13.8
jupytext_version: 1.14.4
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
execution:
timeout: 120
---

(howto-convert-to-generic-seq)=
# How to convert any sequence-like format to `'generic-seq'`

The `'generic-seq'` format is
meant to be a generic sequence-like format
(as suggested by its name)
that all other formats can be converted to.
As explained on its
{ref}`documentation <generic-seq>` page,
a set of `generic-seq` annotations is
literally a set of `crowsetta.Annotation` instances
where each `Annotation` has a `Sequence`.

A goal of crowsetta is to make it easier to share annotations
for a dataset of animal vocalizations or other bioacoustics data.
One way to achieve this is to
convert the annotations to a single flat csv file,
which is easy to share and work with,
e.g., using the [pandas](https://pandas.pydata.org/) library.
For {ref}`sequence-like <formats-seq-like>` annotations,
this can be done by converting them to the `'generic-seq'` format.

This how-to walks you through converting
annotations to the `'generic-seq'` format and
then saving those annotations as a csv file.
As suggested by its name,
it is meant to be a generic sequence-like format
that all other sequence-like formats can be converted to.

## Workflow

Here's the general workflow. We'll see a few different ways to achieve it below.
1. Load annotations in your format
2. Convert those to {class}`crowsetta.Annotation` instances
3. Make a {class}`crowsetta.formats.seq.GenericSeq <crowsetta.formats.seq.generic.GenericSeq>`
from those `Annotation`s.
4. Save to a csv file using the
{meth}`crowsetta.formats.seq.generic.GenericSeq.to_file <crowsetta.formats.seq.generic.GenericSeq.to_file>`
method

This works because `crowsetta` represents a set of annotations in `generic-seq` format
as a list of {class}`crowsetta.Annotation` instances
where each `Annotation` has a {class}`crowsetta.Sequence`.
Since all sequence-like formats have a `to_annot`
method, they can all be converted to `'generic-seq'`.
In turn, this means that any sequence-like format
Expand All @@ -32,52 +54,48 @@ by creating a `'generic-seq'` instance with the
`Annotations` produced by calling `to_annot`
and then calling the `to_file` method of
the `'generic-seq'` instance.
Saving annotations to a single flat .csv file
may make it easier to share and
work with them
(e.g., using the {ref}`pandas <https://pandas.pydata.org/>` library).

## Converting a sequence-like format with multiple annotations per file

Some formats contain multiple annotations per file,
and the `to_annot` method of the corresponding class
will return multiple `crowsetta.Annotation` instances.
To convert this format to `'generic-seq'`,
just pass in those `Annotation`s when
creating an instance of `'generic-seq'`

```{code-cell} ipython3
import crowsetta
example = crowsetta.data.get('birdsong-recognition-dataset')
birdsongrec = crowsetta.formats.seq.BirdsongRec.from_file(example.annot_path)
annots = birdsongrec.to_annot()
print(
f"Number of annotation instances in example 'birdsong-recognition-dataset' file: {len(annots)}"
)
# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print(
f"Converted to 'generic-seq':\n{generic}"
)
```

## Converting a sequence-like format with a single annotation file per annotated file

When the convention for a format
is to have a one-to-one mapping
from annotated file to annotation file,
and we want to put multiple such annotations
into a single generic sequence file,
we need to go through an additional step.
That step consists of collecting all the annotations into a list.

For this example,
we use the same dataset we used in the {ref}`tutorial`,
The first example we show is for possibly the most common case,
where each annotated file has a single annotation file.
This is likely to be the case if you are using apps like Praat or Audacity.
An example of such a format is the Audacity
[standard label track format](https://manual.audacityteam.org/man/importing_and_exporting_labels.html#Standard_.28default.29_format),
exported to .txt files, that you would get if you were to annotate with
[region labels](https://manual.audacityteam.org/man/label_tracks.html#type).
This format is represented by the
{class}`crowsetta.formats.seq.AudTxt <crowsetta.formats.seq.audtxt.AudTxt>`
class in crowsetta.

As described above,
all you need to do is load your sequence-like annotations
with crowsetta,
and then call the `to_annot` method
to convert them to a {class}`crowsetta.Annotation` instance.
When working with a format
where there's one annotation file per annotated file,
this *does* mean you need to load **each** file
and convert it into a separate annotation instance.
(Below we'll see an example of a format
where annotations for multiple files
are contained in a single annotation file,
and so we only need to call `to_annot` once
after loading it to get a list of
{class}`crowsetta.Annotation`s.)
For this first example,
where we have multiple annotation files,
we use a loop to load each one and convert it to a
{class}`crowsetta.Annotation` instance.

We use the same dataset we used in the {ref}`tutorial` for this example,
["Labeled songs of domestic canary M1-2016-spring (Serinus canaria)"](https://zenodo.org/record/6521932)
by Giraudon et al., 2021,
annotated with {ref}`Audacity Labeltrack {aud-txt}` files.
annotated with {ref}`Audacity Labeltrack <aud-txt>` files.

```{code-cell} ipython3
cd ..
```

```{code-cell} ipython3
!curl --no-progress-meter -L 'https://zenodo.org/record/6521932/files/M1-2016-spring_audacity_annotations.zip?download=1' -o './data/M1-2016-spring_audacity_annotations.zip'
Expand All @@ -88,6 +106,8 @@ import shutil
shutil.unpack_archive('./data/M1-2016-spring_audacity_annotations.zip', './data/')
```

#TODO: show with scribe and then with class, explain difference

```{code-cell} ipython3
import pathlib
import crowsetta
Expand All @@ -97,16 +117,64 @@ audtxt_paths = sorted(pathlib.Path('./data/audacity-annotations').glob('*.txt'))
annots = []
for audtxt_path in audtxt_paths:
annots.append(
crowsetta.formats.seq.AudTxt.from_file(audtxt_path).to_annot
crowsetta.formats.seq.AudTxt.from_file(audtxt_path).to_annot()
)
print(
f"Number of annotation instances from Giraudon et al. 2021: {len(annots}"
f"Number of annotation instances from dataset: {len(annots)}"
)
```

```{code-cell} ipython3
# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print("Created 'generic-seq' from annotations")
df = generic.to_df()
print("First five rows of annotations (converted to pandas.DataFrame)")
df.head()
```

```{code-cell} ipython3
print("Last five rows of annotations (converted to pandas.DataFrame)")
df.tail()
```

## Converting a sequence-like format with multiple annotations per file

Some formats contain multiple annotations per file,
and the `to_annot` method of the corresponding class
will return multiple `crowsetta.Annotation` instances.
To convert this format to `'generic-seq'`,
just pass in those `Annotation`s when
creating an instance of `'generic-seq'`

```{code-cell} ipython3
:tags: [hide-cell]
import crowsetta
crowsetta.data.extract_data_files()
```

```{code-cell} ipython3
import crowsetta
example = crowsetta.data.get('birdsong-recognition-dataset')
birdsongrec = crowsetta.formats.seq.BirdsongRec.from_file(example.annot_path)
annots = birdsongrec.to_annot()
print(
f"Converted to 'generic-seq':\n{generic}"
)
f"Number of annotation instances in example 'birdsong-recognition-dataset' file: {len(annots)}"
)
# pass in annots when creating generic-seq instance
generic = crowsetta.formats.seq.GenericSeq(annots=annots)
print("Created 'generic-seq' from annotations")
df = generic.to_df()
print("First five rows of annotations (converted to pandas.DataFrame)")
df.head()
```

```{code-cell} ipython3
print("Last five rows of annotations (converted to pandas.DataFrame)")
df.tail()
```
Loading

0 comments on commit f36a493

Please sign in to comment.