-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: is the grammar 'round-trippable'? #3
Comments
Thank you! I am quite happy with comments and remarks, it really helps to make things better. I think you refer to a property of the parser rather than the grammar. Specifically:
Am I right? The short answer is no, this property does not hold. The parse function is lossy: the use of
The print function is also lossy. It does an additional percent-encoding of certain code points. The property does hold for strict (i.e. valid) URLs I believe.
Do you think it would be a good idea to add some notes about this? A somewhat longer answer: This is a nice question and actually the specification is not quite exact in this area, as it leaves the construction of the parse tree implicit. Basically, I leave it to the reader to infer the parse function from the grammar, and figure out what needs to be preserved and/or dropped by looking at the model. I'd be quite happy to hear ideas about how to make things more explicit and clear. The common approach would be to specify for each production rule, its 'semantics', which is just a function. I think I need such a thing anyway to specify the interpretation IP-addressess. But I haven't figured out how to concretely write that down yet. |
Yep, I'd agree the ability to reconstruct the original input reliably would be a property of a parser. My original issue description was a bit semantically muddled, and described reconstructing the original input unambiguously -- perhaps what I should have said is something more along the lines of (borrowing and extending your notation):
Thanks; that makes sense.
After URL normalization, it seems like that should be true - there might be some gaps prior to resolution though? Multiple dot-segment references, such as
Potentially - although making sure those notes would be relevant and valuable to readers both seem important. Did you have anything in mind?
I'll try, although it's near the limits of my understanding and my sense is that other people have a better knowledge of parsing, grammars and web standards. Further discussion (if it seems like a useful topic to address) is probably a good start. |
I see. This does hold, also for invalid url-strings. The WHATWG Standard actually has this as an explicitly stated goal, however, their parse function also normalises. So that would become:
I expose more of the structure of an URL than they do, so it's a stronger claim for me to make, but not doing so leads to issues for them as well (whatwg/url#574 is an example) so I hope the corner cases can be resolved, then I can make this claim as well. I suppose it would be good to mention such properties in my specification. Maybe I can add tests for it to the reference implementation too. There's a bunch of other gaps still that I need to fill in though.
You might be better than you think! Thinking that you don't know something can be a sign that you already have an intuitive understanding of the problems that need to be solved. That is a big thing. I created issue #4 for any potential discussions about specifying the parse tree and parse function more rigorously. |
That makes sense, thanks -- essentially, apply a one-way process to reach the 'normal form', and from that point, we can parse/print (similar to deserialize/serialize) that form reliably.
Noted, I'll read up on that issue.
Thanks :) I'm certainly glad to understand and learn some of the patterns used to formalize specifications, and to some extent also keen to raise awareness of where I think there may be gaps or opportunities.
For housekeeping purposes I'm considering closing a few issues I've opened and think are 'resolved' (you've covered the questions I had here, and more). Fortunately I don't think that prevents any further discussion should anyone on GitHub want to continue the conversation here. |
I don't mind leaving issues open as a discussion thread either. If there would be a lot more activity then I'd have a better look at process, but I'm not too concerned about that now. Now that I'm thinking about it, I might have a look at the discussions feature of GitHub. I think it is nice if there is a spot for questions rather than issues. |
I think this is good summary of what was talked about here (in case anyone finds it useful), considering
There are some corollaries, most of which trivial, but not necessarily obvious, and still interesting:
In English:
Also note that, interestingly:
Noting, of course, that this is all from what I was able to gather and understand. If there is anything wrong with any of this, feel free to let me know and I can update this post! |
Very cool! |
@alwinb: I am unsure whether I can edit the wiki. Maybe I am just missing something, but you might have to tick off the “Restrict editing to collaborators only” in the repository settings. |
Thanks, can you edit it now? |
Yeah, thanks! 👍 🎉 I made some miscellaneous formatting changes to test it out, and also added a link to the “Discussions” tab. Hopefully my changes seem fine enough! |
I should probably include these in the specification page itself. And re-check them with each change. |
(I'm fairly naive to specifications and grammars, so please don't allow my questions / comments to slow down or distract from progress - it's been great to read this code and spec)
From reading a bit about formal grammars recently I learned about the 'round-trippable' property of some grammar specifications. Basically the idea is that you can parse an input, and then turn the parse tree back into the original input, unambiguously.
Although it might not be the first concern that springs to mind for URL grammars, I think that if it is possible to use a round-trippable grammar, it could eliminate an entire class of future information/communication-loss issues (and potentially even a few security issues as a result) that could spring up in actual implementations.
The text was updated successfully, but these errors were encountered: