Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot make newlLine mandatory for paragraph markers #53

Closed
kavitharaju opened this issue Apr 9, 2020 · 9 comments
Closed

cannot make newlLine mandatory for paragraph markers #53

kavitharaju opened this issue Apr 9, 2020 · 9 comments
Labels
question test This has to do with testing.
Milestone

Comments

@kavitharaju
Copy link
Collaborator

As per the spec all paragraph markers should come on a newline https://ubsicap.github.io/usfm/about/syntax.html?highlight=newline#id5

But the rule in the Grammar was relaxed to accomodate the example is test cases(from paratext and usfm-js). Now making it mandatory leads to these test cases failing.

20 tests are failing in total. Most of them are from paratext test cases and a few from usfm-js's set and a few from our basic tests.

@joelthe1 please recommend how to go about this.
Should we make the change and update these test cases to adhere to the spec?

@kavitharaju kavitharaju added question test This has to do with testing. labels Apr 9, 2020
@joelthe1
Copy link
Collaborator

joelthe1 commented Apr 9, 2020

Can you make a list of the failing test file links as a comment here. We can then point to these and get more information from UBS.

@kavitharaju
Copy link
Collaborator Author

The following are the failing test cases
In the True Positive test cases we added...

  1. it('Markers with attributes, links and extended content markers', function () {
  2. it('Milestone markers', function () {

    From the Paratext test cases
  3. it('NoErrorsEmptyBook', function () {
  4. it('NoErrorsPartiallyEmptyBook', function () {
  5. it('GlossaryNoKeywordErrors', function () {
  6. it('IgnoreKeywordErrorsOutsideOfGlossary', function () {
  7. it('GlossaryCitationForm_Pass', function () {
  8. it('GlossaryCitationFormMultipleWords_Pass', function () {
  9. it('GlossaryCitationFormEndsWithParentheses_Pass', function () {
  10. it('GlossaryCitationFormContainsComma_Pass', function () {
  11. it('GlossaryCitationFormContainingWordMedialPunctuation_Pass', function () {
  12. it('WordlistMarkerTextEndsInSpaceGlossaryEntryPresent_Pass', function () {
  13. it('WordlistMarkerKeywordEndsInSpaceGlossaryEntryPresent_Pass', function () {
  14. it('WordlistMarkerNestedProperNoun_Pass', function () {
  15. it('WordlistMarkerNestedTwoProperNouns_Pass', function () {
  16. it('WordlistMarkerWithKeyword_Pass', function () {
  17. it('WordlistMarkerNestedProperNounWithKeyword_Pass', function () {
  18. it('WordlistMarkerKeywordWithParentheses_Pass', function () {
  19. it('WordlistMarkerKeywordContainsComma_Pass', function () {

From USFM-js test set
20. https://github.com/Bridgeconn/usfm-grammar/blob/master/test/resources/usfm_js/pro_footnote.usfm

@klassenjm
Copy link

Hi @joelthe1 and @kavitharaju .

Before commenting on this, let me just share my appreciation for your Calvin avatar, Joel. Best comics ever. :-)

It sounds like you are indicating that: you have some data from Paratext projects which is not passing your grammar tests if your tests expect a new line before paragraph markers.

I see the list of tests, but it would help me to look at the USFM data which is not passing. Could you provide a couple of examples of USFM and a specific test which is failing?

@joelthe1
Copy link
Collaborator

joelthe1 commented Apr 14, 2020

Thank you @klassenjm 😄 They indeed are the best comics ever!

Before I attempt to answer your questions, I believe some more context might be helpful here. In making this grammar, we created a test suite with USFM strings from the 'wild'. These mainly include strings we took from Paratext' test suite and usfm-js (unfoldingWord's USFM parser).

Our grammar up until this point was passing for all of these. But the spec says that All paragraph markers should be preceded by a single newline.. In making this change in the grammar, a bunch of these test cases seem to fail. Now we haven't yet gone back to check if the other tools have updated their tests. But, regardless, we wanted your input before we make changes on our end.

The links listed by @kavitharaju pin-point the examples that are failing. They are mostly in single lines like here:

\\id MAT 41MATGNT92.SFM, Good News Translation, June 2003\n\\c 1\n\\p \\v 1 \n\\q1 “Someone is shouting in the desert,\n\\q2 ‘Prepare a road for the Lord;\n\\q2 make a straight path for him to travel!’ ”\n\\esb \\cat People\\cat*\n\\ms \\jmp |link-id="article-john_the_baptist"\\jmp*John the Baptist\n\\p John is sometimes called the last “Old Testament prophet” because of the warnings he brought about God\'s judgment and because he announced the coming of God\'s “Chosen One” (Messiah).\n\\esbe\n\\p \n\\v 2-6 From Abraham to King David, the following ancestors are listed: Abraham,...mother was \\jmp Ruth|link-href="#article-Ruth"\\jmp*), Jesse, and King David.\n\\w gracious|link-href="http://bibles.org/search/grace/eng-GNTD/all"\\w*

I also would love to hear your thoughts on validating such USFM. How strict should a validator be? Should we just point to these as warnings or fail them completely? Is there a line, in your mind, that divides the forgivable and unbearable?:p

@klassenjm
Copy link

klassenjm commented Apr 15, 2020

Hi @joelthe1

I started looking at more of the software in your repo, in the dev branch. I may be able to slowly understand your test process, but I'm not sure I have enough time on my hands. :-) Perhaps I could run the test suite on my end? I'm not sure if it's necessary

What might help me, to engage, is to know what you and @kavitharaju mean by 'failing'. In particular are your tests identifying what specifically is non-conforming in the USFM samples? And are we sure it's the data or the test?

The one example you pointed me to was a much less common string of USFM from a study Bible. I did note that it ended with a character marker (\w) which started after \n -- and that should be OK if your parser was treating the newline as a single space. But I'm not sure.

Looking at another simpler sample text...

it('WordlistMarkerWithKeyword_Pass', function () {

\\id GLO\r\n' +
    '\\c 1\n\\p \\v 1 something \\p  \\k keyword-keyw\\k* definition\r\n'
    '\\c 1\n\\p \\v 1 something \\p  \\w word|keyword\\w* definition\r\n' +
    '\\v 1 something \\p  \\w word|lemma=\"keyword\"\\w* definition'

There are \p within these lines which are not preceded by a newline. I saw in your grammar that the newline? was now relaxed. What I'm trying to share is that I'm not sure that I know enough about what 'fail' means without running the tests myself to see, to know what is at fault (data or test).

To get back out of the weeds... :-) You were asking about strictness. That's a somewhat challenging question, I suppose. The guidance on syntax was written to help describe what is a well-formed USFM document. It would include having paragraph markers begin with a new line. But, technically, if you have a parser which knows what paragraph elements are, and perhaps you are converting USFM to JSON, XML etc. then you might consider the newlines to be completely optional in terms of your goal. I think I would say -- that if you are validating for the purpose of verifying the well-formedness of a USFM document itself, as USFM, you would want to follow the syntax guide and be fairly strict - but relaxed about insignificant whitespace. If you are simply validating that there is unambiguous, parse-able USFM, you could be less strict (but dependent on a supplement like usfm.sty or alternative to identify marker types)

Does this help?

@joelthe1
Copy link
Collaborator

joelthe1 commented Apr 15, 2020

Thanks for that detailed answer. I like how you differentiate between verifying well-formedness vs validating parse-able USFM. That is helpful in thinking through modes for the parser.

As for your question relating to the test cases, I really hope (and don't think) you need to run the tests to answer the question. Pointing to the example you helpfully quoted:

    \\id GLO\r\n' +
    '\\c 1\n\\p \\v 1 something \\p  \\k keyword-keyw\\k* definition\r\n' +
    '\\c 1\n\\p \\v 1 something \\p  \\w word|keyword\\w* definition\r\n' +
    '\\v 1 something \\p  \\w word|lemma=\"keyword\"\\w* definition'

This USFM string would parse successfully by the grammar (= test passes). This is parsing in the same sense as a programming language compiler would parse code. So a 'syntax' error would cause the parsing to fail and throw errors. So the test cases are designed to report success if the grammar was able to parse without errors. This example worked because the grammar allowed for \p without newline preceding it.

But re-reading the spec again, it came to our attention that it explicitly says that the \p should be preceded with a newline. Now if we make the change in the grammar to require the newline before \p, the example above 'fails' to parse correctly because of the presence of \p without newline before them. Now we can certainly modify that example to make it parse correctly by the grammar. The question, however, is would you as, @klassenjm, call this example 'invalid' USFM? Or is the answer more nuanced and that it 'depends' on if you are trying to 'verify well-formedness' vs 'validate parse-able USFM'.

Sorry this is belabored but I hope it is clearer what we are asking :)

@klassenjm
Copy link

@joelthe1 It's clear what you're asking.

In the original post, @kavitharaju said "Now making it mandatory leads to these test cases failing". My impression has been that you want to conform to the specification (or have your tests best able to detect non-compliant USFM) and so you were moving in this direction - of making newlines mandatory.

Our discussion highlights the issue. Again, I do think it comes down to the goal(s). If you want to be able to identify all paragraphs of content (the narrative structure) or all chapters or verses (bcv), and perhaps transform that into something else, then the grammar and tests could be looser in some aspects -- the USFM well-formedness is not key to doing that (although some syntax problems would be just wrong, like not having a space before a note caller, like \f+ \fr 3:2 ... and would make a parser have a tougher time).

But, in my opinion, the USFM specification is speaking to what a well formed USFM document is. If we are producing USFM for storage and interchange (with unknown other tools / systems - some of which might just display it, verbatim), I think we would want it to be well-formed. I imagine that almost anyone who is familiar with USFM, and who received a document which they viewed themselves in an editor, would be really puzzled to see paragraph markers floating inline. It would just be really unexpected. If only a software process was dealing with the file and knew about marker properties -- no one would know.

This is my attempt at painting the picture, as I see it, and based on my experience of working with and seeing USFM text for years. With these reflections as my perspective, I think that a document which wants to claim to be valid USFM should be well-formed. I don't know, but I suspect that there would be software developers who would not see the need for that perspective. The reality is that people still look at raw USFM regularly.

Jeff

@joelthe1
Copy link
Collaborator

Thanks @klassenjm! That helps me to understand your perspective which carries good weight in deciding the semantics and posture of this tool.

We'll do some more thinking factoring in your inputs and decide on how to handle this. I am leaning towards having modes for the parser which would determine how strict it is. Either way we hope to get your feedback on the tool and its associated grammar after we do some refactoring.

@kavitharaju kavitharaju added this to the 2.0 version milestone May 21, 2020
@kavitharaju
Copy link
Collaborator Author

For this particular issue we are going with accepting a parse-able USFM and allowing inline paragraph markers, as that would be the priority for users.

The newLines would be used appropriately for paragraph markers if the USFM is re built using round tripping(converting USFM to JSON, and JSON back to USFM)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question test This has to do with testing.
Projects
None yet
Development

No branches or pull requests

3 participants