Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cg-mwesplit adds extra newline #134

Open
snomos opened this issue Nov 17, 2023 · 2 comments
Open

cg-mwesplit adds extra newline #134

snomos opened this issue Nov 17, 2023 · 2 comments

Comments

@snomos
Copy link

snomos commented Nov 17, 2023

Cf the following (using giellalt/lang-sme as example):

echo 'Jođiheaddji guovttosges' | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst 
"<Jođiheaddji guovttosges>"
	"ges" Pcle Foc/ges <W:0.0> "<ges>"
		"jođiheaddji guovttos" N Coll Sem/Group_Hum Sg Loc <W:0.0> "<Jođiheaddji guovttos>"
	"ges" Pcle Foc/ges <W:0.0> "<ges>"
		"jođiheaddji guovttos" N Coll Sem/Group_Hum Sg Nom <W:0.0> "<Jođiheaddji guovttos>"
:\n
'Jođiheaddji guovttosges' | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst | cg-mwesplit 
"<Jođiheaddji guovttos>"
	"jođiheaddji guovttos" N Coll Sem/Group_Hum Sg Loc <W:0.0>
	"jođiheaddji guovttos" N Coll Sem/Group_Hum Sg Nom <W:0.0>
"<ges>"
	"ges" Pcle Foc/ges <W:0.0>
:\n

After cg-mwesplit has been applied, there is an extra newline after the split cohorts that was not there in the input. Do you get the same, @unhammer ?

@unhammer
Copy link
Collaborator

unhammer commented Nov 18, 2023

Yes – this also happens with plain vislcg3 (which typically runs before cg-mwesplit; they use the same underlying CG stream processing code):

$ echo 'makkár nu' | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst  | grep -c '^$'
0
$ echo 'makkár nu' | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst  |vislcg3  -g mwe-dis.cg3 | grep -c '^$'
1

But where does it matter? (Don't all the plugins use the json output format?)

@snomos
Copy link
Author

snomos commented Jan 23, 2024

It just feels "dirty" - the stream is changed in unintended ways. There also was a use case I had in mind when I reported this, but that is a long time ago, and now forgotten. Will add it if/when I remember what it was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants