Skip to content

Commit

Permalink
feat(cli): "und" is now added by default to -l list of languages
Browse files Browse the repository at this point in the history
  • Loading branch information
joanise committed Nov 23, 2021
1 parent 518d3d3 commit fd6189b
Show file tree
Hide file tree
Showing 4 changed files with 56 additions and 19 deletions.
31 changes: 19 additions & 12 deletions docs/cli-guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,10 @@ code <https://en.wikipedia.org/wiki/ISO_639-3>`__ as an argument.
The languages supported by RAS can be listed by running ``readalongs prepare -h``
and they can also be found in the :ref:`cli-prepare` reference.

So, a full command for a story in Algonquin, with a g2p fallback to
So, a full command for a story in Algonquin, with an implicit g2p fallback to
Undetermined, would be something like:

``readalongs prepare -l alq,und Studio/story.txt Studio/story.xml``
``readalongs prepare -l alq Studio/story.txt Studio/story.xml``

The generated XML will be parsed in to sentences. At this stage you can
edit the XML to have any modifications, such as adding ``do-not-align``
Expand Down Expand Up @@ -275,27 +275,34 @@ any element in the XML file:
<s xml:lang="eng" fallback-langs="fra,und">English mixed with français.</s>
These command line examples will set the language to ``fra``, with the g2p cascade
falling back to ``eng`` and then ``und`` when needed:
falling back to ``eng`` and then ``und`` (see below) when needed.

.. code-block:: bash
readalongs prepare -l fra,eng,und myfile.txt myfile.xml
readalongs align -l fra,eng,und myfile.txt myfile.wav output-dir
readalongs prepare -l fra,eng myfile.txt myfile.xml
readalongs align -l fra,eng myfile.txt myfile.wav output-dir
The "Undetermined" language code: und
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Notice that the two examples above use ``und`` as the last language in the
Notice how the sample XML snippet above has ``und`` as the last language in the
cascade. ``und``, for Undetermined, is a special language mapping that
uses the definition of all characters in all alphabets that are part of the
Unicode standard, and
maps them as if the name of that character was how it is pronounced.
While crude, this mapping works surprisingly well for the purposes of
forced alignment, and allows ``readalongs align`` to successfully align
most text with a few foreign words without any manual intervention. We
recommend systematically using ``und`` at the end of the cascade. Note
that adding other languages after ``und`` will have no effect, since
the Undetermined mapping will map any string to valid ARPABET.
most text with a few foreign words without any manual intervention.

Since we recommend systematically using ``und`` at the end of the cascade, it
is now added by default after the languages specified with the ``-l``
switch to both ``readalongs align`` and ``readalongs prepare``. Note that
adding other languages after ``und`` will have no effect, since the
Undetermined mapping will map any string to valid ARPABET.

In the unlikely event that you want to disable adding ``und``, add the hidden
``--lang-no-append-und`` switch, or delete ``und`` from the ``fallback-langs``
attribute in your XML input.

Debugging g2p mapping issues
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -318,7 +325,7 @@ The following series of commands:

::

readalongs prepare -l l1,l2,und file.txt file.xml
readalongs prepare -l l1,l2 file.txt file.xml
readalongs tokenize file.xml file.tokenized.xml
readalongs g2p file.tokenized.xml file.g2p.xml
readalongs align file.g2p.xml file.wav output
Expand All @@ -327,7 +334,7 @@ is equivalent to the single command:

::

readalongs align -l l1,l2,und file.txt file.wav output
readalongs align -l l1,l2 file.txt file.wav output

except that when running the pipeline as four separate commands, you can
edit the XML files between each step to make manual adjustments and
Expand Down
30 changes: 26 additions & 4 deletions readalongs/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,13 @@ def cli():
default=None,
help="OBSOLETE; the input format is now guessedb by extension or contents",
)
@click.option(
"--lang-no-append-und",
is_flag=True,
default=False,
hidden=True,
help="Hidden option to disable to automatic appending of und (Undetermined) to -l",
)
@click.option(
"-l",
"--language",
Expand Down Expand Up @@ -357,10 +364,13 @@ def align(**kwargs):
raise click.BadParameter(
"No input language specified for plain text input. Please provide the -l/--language switch."
)
languages = kwargs["language"]
if not kwargs["lang_no_append_und"] and "und" not in languages:
languages.append("und")
plain_textfile = kwargs["textfile"]
_, xml_textfile = create_input_tei(
input_file_name=plain_textfile,
text_languages=kwargs["language"],
text_languages=languages,
save_temps=temp_base,
)
else:
Expand Down Expand Up @@ -404,6 +414,13 @@ def align(**kwargs):
@click.option(
"-f", "--force-overwrite", is_flag=True, help="Force overwrite output files"
)
@click.option(
"--lang-no-append-und",
is_flag=True,
default=False,
hidden=True,
help="Hidden option to disable to automatic appending of und (Undetermined) to -l",
)
@click.option(
"-l",
"--language",
Expand Down Expand Up @@ -451,9 +468,13 @@ def prepare(**kwargs):
out_file = out_file[:-4]
out_file += ".xml"

languages = kwargs["language"]
if not kwargs["lang_no_append_und"] and "und" not in languages:
languages.append("und")

if out_file == "-":
_, filename = create_input_tei(
input_file_handle=input_file, text_languages=kwargs["language"],
input_file_handle=input_file, text_languages=languages,
)
with io.open(filename, encoding="utf8") as f:
sys.stdout.write(f.read())
Expand All @@ -467,7 +488,7 @@ def prepare(**kwargs):

_, filename = create_input_tei(
input_file_handle=input_file,
text_languages=kwargs["language"],
text_languages=languages,
output_file=out_file,
)

Expand Down Expand Up @@ -573,7 +594,8 @@ def g2p(**kwargs):
ancestors in TOKFILE has the attribute "fallback-langs" containing a comma-
or colon-separated list of language codes. Provide multiple language codes to
"readalongs prepare" via its -l option to generate this attribute globally,
or add it manually where needed.
or add it manually where needed. Undetermined, "und", is automatically
added at the end of the language list provided via -l.
With the g2p cascade, if a word cannot be mapped to valid ARPABET with the
language found in the "xml:lang" attribute, the languages in
Expand Down
2 changes: 1 addition & 1 deletion test/data/fra-prepared.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<!-- To exclude any element from alignment, add the do-not-align="true" attribute to
it, e.g., <p do-not-align="true">...</p>, or
<s>Some text <foo do-not-align="true">do not align this</foo> more text</s> -->
<text xml:lang="fra" fallback-langs="">
<text xml:lang="fra" fallback-langs="und">
<body>
<div type="page">
<p>
Expand Down
12 changes: 10 additions & 2 deletions test/test_g2p_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,14 @@ def write_prepare_tokenize(self, text, lang, filename):
with open(filename + ".input.txt", "w", encoding="utf8") as f:
print(text, file=f)
self.runner.invoke(
prepare, ["-l", lang, filename + ".input.txt", filename + ".prepared.xml"]
prepare,
[
"-l",
lang,
"--lang-no-append-und",
filename + ".input.txt",
filename + ".prepared.xml",
],
)
self.runner.invoke(tokenize, [filename + ".prepared.xml", filename])

Expand Down Expand Up @@ -196,7 +203,8 @@ def test_align_with_error(self):
pass
output_dir = os.path.join(self.tempdir, "aligned")
results = self.runner.invoke(
align, ["-l", "eng", text_file, empty_wav, output_dir]
align,
["-l", "eng", text_file, empty_wav, output_dir, "--lang-no-append-und"],
)
if self.show_invoke_output:
print(
Expand Down

0 comments on commit fd6189b

Please sign in to comment.