Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

divvun-normaliser #44

Open
4 of 7 tasks
snomos opened this issue Mar 4, 2021 · 6 comments
Open
4 of 7 tasks

divvun-normaliser #44

snomos opened this issue Mar 4, 2021 · 6 comments
Assignees

Comments

@snomos
Copy link
Member

snomos commented Mar 4, 2021

Draft specification here.

Tasks:

@snomos
Copy link
Member Author

snomos commented Mar 4, 2021

The folllowing works fine without divvun-normaliser:

echo 'Man vuoras: 23' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g tools/tokenisers/mwe-dis.bin | cg-mwesplit 
"<Man>"
	"Man" N Prop Sem/Plc Sg Nom <W:0.0>
	"Man" N Prop Sem/Sur Sg Nom <W:0.0>
	"man" Adv <W:0.0>
	"mij" Pron Interr Sg Gen <W:0.0>
	"mij" Pron Interr Sg Ill Attr <W:0.0>
	"mij" Pron Interr Sg Ine Attr <W:0.0>
	"mij" Pron Rel Sg Gen <W:0.0>
	"mij" Pron Rel Sg Ill Attr <W:0.0>
	"mij" Pron Rel Sg Ine Attr <W:0.0>
: 
"<vuoras>"
	"vuoras" A Attr <W:0.0>
	"vuoras" A Sg Nom <W:0.0>
	"vuoras" Err/Orth A Attr <W:0.0>
	"vuoras" Err/Orth A Sg Nom <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Err/Orth Sg3 <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Sg3 <W:0.0>
"<:>"
	":" CLB <W:0.0>
: 
"<23>"
	"23" A Arab Ord Attr CLBfinal <W:0.0>
	"23" Num Arab Sg Ela Attr <W:0.0>
	"23" Num Arab Sg Gen <W:0.0>
	"23" Num Arab Sg Ill Attr <W:0.0>
	"23" Num Arab Sg Ine Attr <W:0.0>
	"23" Num Arab Sg Nom <W:0.0>
	"23" Num Sem/ID <W:0.0>
:\n

But with divvun-normaliser I get a lidivvun error (and not the expected output format):

echo 'Man vuoras: 23' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g tools/tokenisers/mwe-dis.bin \
| cg-mwesplit \
| divvun-normaliser -a src/analyser-gt-desc.hfst -n tools/tts/transcriptor-gt-desc.hfst -g src/generator-gt-norm.hfst 
libdivvun: ERROR: HfstException.
"<Man>"
: 
"<vuoras>"
"<:>"
: 
"<23>"
:\n

@flammie
Copy link
Contributor

flammie commented Mar 5, 2021

It seems I didn't manage to set the default for -t tags so it didn't print nothing, now it should copy input if no tags are set to be expanded.

@flammie
Copy link
Contributor

flammie commented Mar 5, 2021

pushed few more debugging; it seems we need hfstol's to lookup_fd:

echo 'Man vuoras: 23' | hfst-tokenise -g ~/github/giellalt/lang-smj/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | vislcg3 -g ~/github/giellalt/lang-smj/tools/tokenisers/mwe-dis.bin | cg-mwesplit | src/divvun-normaliser -a ~/github/giellalt/lang-smj/src/analyser-gt-desc.hfstol -n ~/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol -g ~/github/giellalt/lang-smj/src/generator-gt-norm.hfstol --tags Arab -v
libdivvun: ERROR: HfstException: Exception: NotTransducerStreamException: transducer type not recognised in file: HfstInputStream.cc on line: 1088
Read /home/flammie/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol, /home/flammie/github/giellalt/lang-smj/src/generator-gt-norm.hfstol, /home/flammie/github/giellalt/lang-smj/src/analyser-gt-desc.hfstol
"<Man>"
	"Man" N Prop Sem/Plc Sg Nom <W:0.0>
	"Man" N Prop Sem/Sur Sg Nom <W:0.0>
	"man" Adv <W:0.0>
	"mij" Pron Interr Sg Gen <W:0.0>
	"mij" Pron Interr Sg Ill Attr <W:0.0>
	"mij" Pron Interr Sg Ine Attr <W:0.0>
	"mij" Pron Rel Sg Gen <W:0.0>
	"mij" Pron Rel Sg Ill Attr <W:0.0>
	"mij" Pron Rel Sg Ine Attr <W:0.0>
: 
"<vuoras>"
	"vuoras" A Attr <W:0.0>
	"vuoras" A Sg Nom <W:0.0>
	"vuoras" Err/Orth A Attr <W:0.0>
	"vuoras" Err/Orth A Sg Nom <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Err/Orth Sg3 <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Sg3 <W:0.0>
"<:>"
	":" CLB <W:0.0>
: 
"<23>"
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" A Arab Ord Attr CLBfinal <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Ela Attr <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Gen <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Ill Attr <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Ine Attr <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Nom <W:0.0>
	"23" Num Sem/ID <W:0.0>
:\n

@snomos
Copy link
Member Author

snomos commented Mar 5, 2021

Nice progress 🙂

@unhammer are there any CG syntax restrictions on the transcripted string, "guaktalåkgålmmå"phon in the test case above? We modelled it after the divvun-cgspell output, but that one has only one letter after the actual string. Just asking to avoid major changes later 🙂

@TinoDidriksen
Copy link
Contributor

"guaktalåkgålmmå"phon is a valid CG tag, though it is not considered a textual tag - not that I think that matters for you. The rule is that if it starts with " then include anything to next " and from there include to next whitespace. This avoids much unnecessary escaping.

@snomos
Copy link
Member Author

snomos commented Apr 14, 2022

A case we haven't considered: dynamic compounds, ie cohorts with sub-readings. There are two considerations:

  • we create subreadings out of the original - the normalized reading is the main reading, the original is stored in a subreading
  • in dynamic compounds, we may want to normalize each part separately, as in:
echo 1800-lågon | ./tools/tts/modes/smj-txt2ipa.mode 
"<1800-lågon>"
	"lågos" N Sem/Dummytag Ess <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/Hyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
	"låhko" N Sem/Amount Sg Ine <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/Hyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
	"lågos" N Sem/Dummytag Ess <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/OblHyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
	"låhko" N Sem/Amount Sg Ine <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/OblHyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
:\n

If we could normalize 1800- independently of the rest of the compound, we would solve a lot of corner cases.

Perhaps the best solution would be to not change the basic cohort structure at all, ie that we do NOT add the original lemma as a subreading. Instead I suggest that we store the original in a tag string along the lines of the "abc"phon string, something like: "1800-"orig or "1800-"olemma or something similar. The main purpose of retaining the original lemma is for debugging, and changing the cohort structure seems to cost too much.

@flammie could you have a look at this? I added the new tasks to the task list in the initial comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants