-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot make newlLine mandatory for paragraph markers #53
Comments
Can you make a list of the failing test file links as a comment here. We can then point to these and get more information from UBS. |
The following are the failing test cases
From USFM-js test set |
Hi @joelthe1 and @kavitharaju . Before commenting on this, let me just share my appreciation for your Calvin avatar, Joel. Best comics ever. :-) It sounds like you are indicating that: you have some data from Paratext projects which is not passing your grammar tests if your tests expect a new line before paragraph markers. I see the list of tests, but it would help me to look at the USFM data which is not passing. Could you provide a couple of examples of USFM and a specific test which is failing? |
Thank you @klassenjm 😄 They indeed are the best comics ever! Before I attempt to answer your questions, I believe some more context might be helpful here. In making this grammar, we created a test suite with USFM strings from the 'wild'. These mainly include strings we took from Paratext' test suite and usfm-js (unfoldingWord's USFM parser). Our grammar up until this point was passing for all of these. But the spec says that The links listed by @kavitharaju pin-point the examples that are failing. They are mostly in single lines like here:
I also would love to hear your thoughts on validating such USFM. How strict should a validator be? Should we just point to these as warnings or fail them completely? Is there a line, in your mind, that divides the forgivable and unbearable?:p |
Hi @joelthe1 I started looking at more of the software in your repo, in the dev branch. I may be able to slowly understand your test process, but I'm not sure I have enough time on my hands. :-) Perhaps I could run the test suite on my end? I'm not sure if it's necessary What might help me, to engage, is to know what you and @kavitharaju mean by 'failing'. In particular are your tests identifying what specifically is non-conforming in the USFM samples? And are we sure it's the data or the test? The one example you pointed me to was a much less common string of USFM from a study Bible. I did note that it ended with a character marker ( Looking at another simpler sample text... Line 987 in f087caf
There are To get back out of the weeds... :-) You were asking about strictness. That's a somewhat challenging question, I suppose. The guidance on syntax was written to help describe what is a well-formed USFM document. It would include having paragraph markers begin with a new line. But, technically, if you have a parser which knows what paragraph elements are, and perhaps you are converting USFM to JSON, XML etc. then you might consider the newlines to be completely optional in terms of your goal. I think I would say -- that if you are validating for the purpose of verifying the well-formedness of a USFM document itself, as USFM, you would want to follow the syntax guide and be fairly strict - but relaxed about insignificant whitespace. If you are simply validating that there is unambiguous, parse-able USFM, you could be less strict (but dependent on a supplement like usfm.sty or alternative to identify marker types) Does this help? |
Thanks for that detailed answer. I like how you differentiate between verifying well-formedness vs validating parse-able USFM. That is helpful in thinking through modes for the parser. As for your question relating to the test cases, I really hope (and don't think) you need to run the tests to answer the question. Pointing to the example you helpfully quoted:
This USFM string would parse successfully by the grammar (= test passes). This is parsing in the same sense as a programming language compiler would parse code. So a 'syntax' error would cause the parsing to fail and throw errors. So the test cases are designed to report success if the grammar was able to parse without errors. This example worked because the grammar allowed for But re-reading the spec again, it came to our attention that it explicitly says that the Sorry this is belabored but I hope it is clearer what we are asking :) |
@joelthe1 It's clear what you're asking. In the original post, @kavitharaju said "Now making it mandatory leads to these test cases failing". My impression has been that you want to conform to the specification (or have your tests best able to detect non-compliant USFM) and so you were moving in this direction - of making newlines mandatory. Our discussion highlights the issue. Again, I do think it comes down to the goal(s). If you want to be able to identify all paragraphs of content (the narrative structure) or all chapters or verses (bcv), and perhaps transform that into something else, then the grammar and tests could be looser in some aspects -- the USFM well-formedness is not key to doing that (although some syntax problems would be just wrong, like not having a space before a note caller, like But, in my opinion, the USFM specification is speaking to what a well formed USFM document is. If we are producing USFM for storage and interchange (with unknown other tools / systems - some of which might just display it, verbatim), I think we would want it to be well-formed. I imagine that almost anyone who is familiar with USFM, and who received a document which they viewed themselves in an editor, would be really puzzled to see paragraph markers floating inline. It would just be really unexpected. If only a software process was dealing with the file and knew about marker properties -- no one would know. This is my attempt at painting the picture, as I see it, and based on my experience of working with and seeing USFM text for years. With these reflections as my perspective, I think that a document which wants to claim to be valid USFM should be well-formed. I don't know, but I suspect that there would be software developers who would not see the need for that perspective. The reality is that people still look at raw USFM regularly. Jeff |
Thanks @klassenjm! That helps me to understand your perspective which carries good weight in deciding the semantics and posture of this tool. We'll do some more thinking factoring in your inputs and decide on how to handle this. I am leaning towards having modes for the parser which would determine how strict it is. Either way we hope to get your feedback on the tool and its associated grammar after we do some refactoring. |
For this particular issue we are going with accepting a parse-able USFM and allowing inline paragraph markers, as that would be the priority for users. The newLines would be used appropriately for paragraph markers if the USFM is re built using round tripping(converting USFM to JSON, and JSON back to USFM) |
As per the spec all paragraph markers should come on a newline https://ubsicap.github.io/usfm/about/syntax.html?highlight=newline#id5
But the rule in the Grammar was relaxed to accomodate the example is test cases(from paratext and usfm-js). Now making it mandatory leads to these test cases failing.
20 tests are failing in total. Most of them are from paratext test cases and a few from usfm-js's set and a few from our basic tests.
@joelthe1 please recommend how to go about this.
Should we make the change and update these test cases to adhere to the spec?
The text was updated successfully, but these errors were encountered: