Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize each CG sub-reading separately, like phonemisation #58

Open
Tracked by #44
snomos opened this issue Sep 12, 2023 · 13 comments
Open
Tracked by #44

Normalize each CG sub-reading separately, like phonemisation #58

snomos opened this issue Sep 12, 2023 · 13 comments
Assignees

Comments

@snomos
Copy link
Member

snomos commented Sep 12, 2023

Cf #44 (comment)

See also the following example:

Ulmme lij gehtjadit gåktu 25 jahkebuolva 15-jagágij lidjin jåhtålam 23 sáme-vuona rabdaguovlo suohkanijs.

where in 15-jagágij 15 is not transcribed.

@snomos snomos mentioned this issue Sep 12, 2023
7 tasks
@snomos snomos changed the title add support for doing normalization of single main/sub readings in dynamic compounds, like we do for phonemisation Normalize each CG sub-reading separately, like phonemisation Sep 12, 2023
flammie added a commit that referenced this issue Sep 12, 2023
@flammie
Copy link
Contributor

flammie commented Sep 12, 2023

Currently the transcriptor is set up to look up nearest surface form, with subreadings without surface form tags or other similar tags it falls back to 15-jagágij which is not in transcriptor. Maybe using lemma makes sense with transcription though

@flammie
Copy link
Contributor

flammie commented Sep 12, 2023

I see the other bug now, yeah it would be much easier possibly to not mess with more subreadings here...

@flammie
Copy link
Contributor

flammie commented Sep 12, 2023

"<15-jagágij>"
	"jahke" Ex/N Sem/Time Der/k A <smj> Pl Com <W:0.0> @<ADVL
		"lågenanvihtta" Num Sg Nom "lågenanvihtta"phon "15"oldlemma
	"jahke" Ex/N Sem/Time Der/k A <smj> Sg Ill <W:0.0> @<ADVL
		"lågenanvihtta" Num Sg Nom "lågenanvihtta"phon "15"oldlemma

this is current output after normalise

@snomos
Copy link
Member Author

snomos commented Sep 12, 2023

Looks good to me. What do you think, @ilm024 ?

what would be the full compound output?

@snomos
Copy link
Member Author

snomos commented Sep 13, 2023

With newest divvun-normalise I get the following:

"<15-jagágij>"
	"jahke" Ex/N Sem/Time Der/k A Pl Com "15-#»jagág9>ij"MIDTAPE <W:0.0> @<ADVL #7->3
		"15" Num Cmp/Hyph Cmp "15-#»jagág9>ij"MIDTAPE <W:0.0> #7->3
	"jahke" Ex/N Sem/Time Der/k A Sg Ill "15-#»jagág9>ij"MIDTAPE <W:0.0> @<ADVL #7->3
		"15" Num Cmp/Hyph Cmp "15-#»jagág9>ij"MIDTAPE <W:0.0> #7->3

What is missing to get what you get?

@flammie
Copy link
Contributor

flammie commented Sep 13, 2023

Probably version differences, the midtapes would confuse the normalise lookup and I don't get them with my hfst as it is now. So the output of smj-normaliser6-cg,mode is just:

"<15-jagágij>"
	"jahke" Ex/N Sem/Time Der/k A Pl Com <W:0.0> @<ADVL #7->3
		"15" Num Cmp/Hyph Cmp <W:0.0> #7->3
	"jahke" Ex/N Sem/Time Der/k A Sg Ill <W:0.0> @<ADVL #7->3
		"15" Num Cmp/Hyph Cmp <W:0.0> #7->3

@snomos
Copy link
Member Author

snomos commented Sep 13, 2023

Ok. What is the input and the command you used to get the desired output?

@flammie
Copy link
Contributor

flammie commented Sep 13, 2023

e.g. echo Ulmme lij gehtjadit gåktu 25 jahkebuolva 15-jagágij lidjin jåhtålam 23 sáme-vuona rabdaguovlo suohkanijs | $GTLANGS/lang-smj/tools/tts/modes/smj-normaliser6-cg.mode etc., not sure why I don't get midtapes, deubgging like with --verbose: echo 15-jagágij | ~/github/hfst/hfst/tools/src/hfst-tokenize -g '/home/flammie/github/giellalt/lang-smj/tools/tokenisers/tokeniser-tts-cggt-desc.pmhfst.tmp' -v just shows no results for lookups on midtapes

@flammie
Copy link
Contributor

flammie commented Sep 13, 2023

I commented midtape reading out , not sure if it made sense in normalising step or copy-paste from phonemiser

@snomos
Copy link
Member Author

snomos commented Sep 13, 2023

I am not sure either whether we need midtape in the normaliser process, but we definitely need to retain midtape strings for later IPA conversion.

IIRC the idea was to have an option for "deep analysis" that would generate the midtape stuff for normalised input.

@flammie
Copy link
Contributor

flammie commented Sep 14, 2023

well, midtape is kind-of retained now if it gets used by phon:

"<15-jagágij>"
	"jahke" Ex/N Sem/Time Der/k A Pl Com "15-#»jagág9>ij"MIDTAPE <W:0.0> @<ADVL #7->3
		"lågenanvihtta" Num Sg Nom "lågenanvihtta"phon "15"oldlemma
	"jahke" Ex/N Sem/Time Der/k A Sg Ill "15-#»jagág9>ij"MIDTAPE <W:0.0> @<ADVL #7->3
		"lågenanvihtta" Num Sg Nom "lågenanvihtta"phon "15"oldlemma

@snomos
Copy link
Member Author

snomos commented Sep 14, 2023

ok, good 🙂

@snomos
Copy link
Member Author

snomos commented Sep 14, 2023

we probably have to use the deep analyser thing to get a full MIDTAPE representation, if we need that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants