Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser testing is incredibly inconvenient using Test262 #1356

Closed
dead-claudia opened this issue Nov 23, 2017 · 19 comments · Fixed by #1375
Closed

Parser testing is incredibly inconvenient using Test262 #1356

dead-claudia opened this issue Nov 23, 2017 · 19 comments · Fixed by #1375

Comments

@dead-claudia
Copy link

dead-claudia commented Nov 23, 2017

I was going through the tests to adapt them for parsers (since this is hardly maintained, much less kept up to date), and I found a couple major issues:

Some tests aren't actually testing what they're supposed to

Here's a few examples:

  • This test doesn't actually test functions with Unicode escapes in their name, because the string itself resolves it to literally A. So what's evaled is function __funcA(__arg){return __arg;}; __funcA, not the expected function __func\u0041(__arg){return __arg;}; __funcA. It also has the below issue.
  • These six tests (1, 2, 3, 4, 5, 6) have other grammatical errors that are irrelevant to what's being tested.

A ton of tests are using eval unnecessarily

I've found literally over 300 instances where I had to replace a test with another because of eval getting in the way. And with few exceptions (like these three), the vast majority of them could be refactored to check for an early error without meaningfully changing the test. For some examples of how this could be achieved, I did find two good examples of what would be helpful for parsers. Those two, I don't have to preprocess and replace their contents, unlike every single other example here.


What would I like to see happen?

Here's a few ones:

  1. Don't use eval unless you have to. Any time you use eval, it's inaccessible to parsers that only concern themselves with static semantics. This includes files like this, where you can always use the raw characters themselves.
  2. List eval in features in the frontmatter whenever you use it. This doesn't provide much direct benefit to parsers, but it could definitely provide benefit to engines that don't support dynamic evaluation. (Kinoma XS6 is one example of this, but I suspect there may be others.) Edit: This is probably the weakest among them, since it doesn't directly affect me.
  3. For each test that checks for static semantics (like the exceptions I noted above), if you must use eval, provide both an eval-based version and a static fallback.
  4. As do expressions appear, assuming they still return completion values, start substituting those for equivalent eval versions as applicable. Parsers will understand the former, but they can't process the latter.

This was for a pretty simple script I wrote, but the workaround took about 3 days worth of work to do, so I'm obviously not happy with the way this turned out...

@bakkot
Copy link
Contributor

bakkot commented Nov 23, 2017

For what it's worth, I do try to maintain test262-parser-tests, but haven't been porting tests from test262 proper (and am still undecided if and how to go about that), so it does tend to have less coverage especially of new features.


I agree that it would be nice to not use eval in tests which test something else. In a lot of cases it would be enough to just split files into an eval and non-eval version, I think.


List eval in features in the frontmatter whenever you use it. This doesn't provide much direct benefit to parsers, but it could definitely provide benefit to engines that don't support dynamic evaluation.

I don't know if this is worth it. I maintain a project which amounts to an implementation not supporting eval (internally, not something public), but I find it straightforward to blacklist test262 tests based on the presence of calls to eval rather than checking flags: the presence of eval doesn't confuse parsers in the way that the presence of new language features does, so it's easy to rely on the AST instead of the frontmatter.

On the other hand, if (as I think *should be the case, whether or not it is) every test containing a call to eval parses successfully, it should be pretty easy to automatically add that flag. If there really is significant benefit to consumers I'd be happy to do that myself, if no one else wanted to.

@dead-claudia
Copy link
Author

@bakkot

For what it's worth, I do try to maintain test262-parser-tests

You could probably find a way to use a variant of what I have (which generally works, but it's a friend who's been primarily using it).

I'll also note that its highly Shift-specific nature makes it much more difficult to use for ESTree-based parsers.

I don't know if this is worth it. [stuff about eval in features...]

This is probably among the weakest of mine, and when I initially thought about it (after typing it out), I almost removed it before filing the issue, since it didn't really affect parsers so much.

On the other hand, if (as I think *should be the case, whether or not it is) every test containing a call to eval parses successfully, it should be pretty easy to automatically add that flag.

I was thinking that, too, except you'll run into issues with indirect eval, and it'd probably be wise to also consider including the Function constructor (although that's a bit more easily grep'd) and making it more generic like "includes dynamic evaluation" rather than something specific to eval.

@bakkot
Copy link
Contributor

bakkot commented Nov 23, 2017

I'll also note that its highly Shift-specific nature makes it much more difficult to use for ESTree-based parsers.

Say more? I've run it against other parsers without problems; what's the Shift-specific nature you ran into? (I guess this might the wrong place for that discussion; please feel free to ping me on irc or open an issue there if you'd rather not continue here.)

@dead-claudia
Copy link
Author

dead-claudia commented Nov 24, 2017

Say more? I've run it against other parsers without problems; what's the Shift-specific nature you ran into?

More just the fact Shift != ESTree, and the substantial difference between the two AST specs structurally. If you want to find me on Gitter, I can elaborate a little more on this if you'd like.

@littledan
Copy link
Member

Would it be worth revisiting the possibility of merging test262-parser-tests into test262? I'd be curious about any problems with test262-parser-tests too; it would probably be useful to explain here for others.

@dead-claudia
Copy link
Author

dead-claudia commented Nov 24, 2017 via email

@littledan
Copy link
Member

I imagine it's hard to validate ASTs without picking one, which could be a lot of harness work to adapt to different projects with different ASTs. Maybe Syntax Error or no is a good start.

@jugglinmike
Copy link
Contributor

Thanks for the detailed report, Isiah! I'm reading some frustration in the tone of your prose. That may be a fabrication on my end, but I can appreciate if you're feeling discouraged. Test262 wasn't created with parsers in mind, and you've had to put in a lot of extra work in order to benefit from those extra 300 tests.

Over the past several years, we've been working to make up for this deficiency (see gh-196, gh-360, gh-382, gh-542, gh-655, gh-778, gh-1254), so the fact that we are able to discuss improvement like these at all is reflective of a fair amount of effort. We clearly still have some work ahead of us, but you can take this as a sure sign that the project maintainers are interested. It might also be encouraging for you to know that all of the troubled tests are named according to the "legacy" naming scheme. I point this out to demonstrate that the issues you've identified are technical debt and don't reflect the current standards for new contributions.

Regarding your recommendations:

  1. List eval in features in the frontmatter whenever you use it. This doesn't provide much direct benefit to parsers, but it could definitely provide benefit to engines that don't support dynamic evaluation. (Kinoma XS6 is one example of this, but I suspect there may be others.)

I am usually the first to argue in favor of "more metadata," but I've come to appreciate the maintenance burden that can carry. This isn't to say that we shouldn't introduce an eval feature, but we should identify a real user/use case before taking on that requirement.

  1. As do expressions appear, assuming they still return completion values, start substituting those for equivalent eval versions as applicable. Parsers will understand the former, but they can't process the latter.

This could help, but it is a very long-term solution. As of today, the latest attempt at do expressions is at stage 1 in the ECMAScript proposal process. We'll need to have a high degree of confidence in that feature's implementation before relying on its semantics in unrelated tests--otherwise, we'll effectively be decreasing coverage for many consumers. We could take a hybrid approach by exposing two versions of the tests in questions (one using do expressions for the benefit of parsers, and one using eval for the benefit of more conservative implementations), but even this will have to wait a while.

  1. Don't use eval unless you have to. Any time you use eval, it's inaccessible to parsers that only concern themselves with static semantics. This includes files like this, where you can always use the raw characters themselves.

[...]

  1. For each test that checks for static semantics (like the exceptions I noted above), if you must use eval, provide both an eval-based version and a static fallback.

We're already on-board with you there. As noted above, I consider the tests you've identified as technical debt. Your script may come in handy here. I'm sure there are other things you would have rather spent those three days doing, but you might be able to wring out some more value by using it to update the tests. What do you say?

@bakkot
Copy link
Contributor

bakkot commented Nov 26, 2017

I imagine it's hard to validate ASTs without picking one

The best idea I've had for this is to provide both "maximally obvious" versions of all syntactically correct files, so that parsers can assert that the "obvious" version parses to "the same" AST modulo location differences as the primary version (where the terms in quotes are not particularly well defined). That's what test262-parser-tests currently does.

@littledan
Copy link
Member

@jugglinmike I appreciate your detailed response and all of your work on this issue. I agree with you in not immediately taking up most of the initial suggestions, and look forward to your continued work in adding more parser tests and facilitating adding more syntax error tests. I'm actually fine with leaving in complex, valid tests, but simpler tests could be a good addition as well.

Previously, I thought there was an idea that, for each syntax error test, we'd have to be able to identify the line of spec that banned it (is this right?). This is a bit hard for a grammar, but I believe this was holding test262-parser-tests back from being merged into test262. What are your current thoughts on that question?

@dead-claudia
Copy link
Author

@jugglinmike

Thanks for the detailed report, Isiah! I'm reading some frustration in the tone of your prose. That may be a fabrication on my end, but I can appreciate if you're feeling discouraged. Test262 wasn't created with parsers in mind, and you've had to put in a lot of extra work in order to benefit from those extra 300 tests.

You did catch on to my frustration correctly. I tried to highlight it while still keeping it civil and polite.

This could help, but it is a very long-term solution. [...] We could take a hybrid approach by exposing two versions of the tests in questions (one using do expressions for the benefit of parsers, and one using eval for the benefit of more conservative implementations), but even this will have to wait a while.

I'm fully aware it's a very long-term solution unlikely to be implemented soon beyond Babylon and friends. It's partially why I said "assuming they still return completion values" - I personally recall suggesting another interpretation method (for loops specifically - pushing to and returning an array rather than just the completion value) in es-discuss, but it's a bit buried.

We're already on-board with you there. As noted above, I consider the tests you've identified as technical debt. Your script may come in handy here. I'm sure there are other things you would have rather spent those three days doing, but you might be able to wring out some more value by using it to update the tests. What do you say?

The script itself won't exactly be useful for dealing with that (it just links tests to a parser and a testing framework), but the auxiliary data would at least give a guide for how to update them. I doubt you'd be able to simply run an automated script to convert them, either - the examples are all missing their frontmatter as well as most of the associated data, and some of the replacements are non-functional one-liners rather than the fully fledged tests they replaced.

So I expect it'd take even longer to accomplish the rest. If someone else wants to, feel free, but I'm kind of burnt out on it. (I also decided to take a bit of a coding break afterwards. Between that and other things, it was well-needed.)

@jugglinmike
Copy link
Contributor

@jugglinmike I appreciate your detailed response and all of your work on this
issue. I agree with you in not immediately taking up most of the initial
suggestions, and look forward to your continued work in adding more parser
tests and facilitating adding more syntax error tests. I'm actually fine with
leaving in complex, valid tests, but simpler tests could be a good addition
as well.

Aw thanks, Dan!

Previously, I thought there was an idea that, for each syntax error test,
we'd have to be able to identify the line of spec that banned it (is this
right?). This is a bit hard for a grammar, but I believe this was holding
test262-parser-tests back from being merged into test262. What are your
current thoughts on that question?

From my perspective, Test262 benefits immensely from a strong file organization. Being able to definitively identify a single "correct" location for any given test helps maintainers avoid duplication and locate holes in coverage. These are some of the biggest challenges in maintaining a project with over 30,000 tests. It's why I personally have not been motivated by the occasional complaint from contributors about the extra effort the practice requires.

It's unrealistic to expect a completely objective organization structure (particularly when it comes to test file names), but I think something (e.g. directory structure) is far better than nothing.

Unfortunately, when it comes to tests that specifically concern parsing, this policy applies unevenly:

  • "Positive" syntax tests can be organized in this way. However, because we also serve runtimes, these generally include additional code asserting semantics, and that code is irrelevant for parsers. I previously considered this acceptable "noise", but @bterlson and @michaelficarra suggested that it may actually degrade test quality for parsers.
  • "Negative" syntax tests for early errors can be organized in this way without much difficulty
  • "Negative" syntax tests for code that doesn't match the grammar are difficult to classify. Sometimes, a good location is fairly obvious. A negative syntax test for void; probably belongs in language/expressions/void. But this breaks down fairly quickly--where would you place a test for code like foo bar; or ]?

It's that last consideration that has made me most reluctant to incorporate the test material in the "test262-parser-tests" project. But it may be that I've lost the forest for the trees. Having written all this out, I'm forced to recognize that the "strong naming scheme" requirement is only in place because it serves an end: "avoid duplication and locate holes in coverage." One interesting thing about @bakkot's tooling is that it attempts to serve that same end in the absence of a meaningful naming scheme.

In light of that, maybe we should consider bringing in a subset of the "test262-parser-tests"--specifically the tests labeled "fail". We would likewise eschew any sort of meaningful naming convention (likely also omitting metadata for the same reasons) and place all files in some generic directory like test/language/violations or test/not-language. I would still insist on strong naming for "positive" syntax tests and tests for early errors, but this could be a way to improve Test262's usefulness without sacrificing maintainability.

In case it isn't clear: I'm just spit-balling at this point. There are certainly more details to work out (notably: if Test262 can/should maintain tooling in Python and JavaScript, and the task of reviewing all of those tests), but I would like to hear from you folks (plus @rwaldron and @leobalter) before going too far with this line of thinking.

@jugglinmike
Copy link
Contributor

You did catch on to my frustration correctly. I tried to highlight it while
still keeping it civil and polite.

For what it's worth, I think you did that well :)

So I expect it'd take even longer to accomplish the rest. If someone else
wants to, feel free, but I'm kind of burnt out on it. (I also decided to take
a bit of a coding break afterwards. Between that and other things, it was
well-needed.)

Fair enough; we can file this as a "nice to have" effort, given that the current state is technically valid. I'll leave it to @leobalter to decide if a new "tracking" issue is in order.

@leobalter
Copy link
Member

In light of that, maybe we should consider bringing in a subset of the "test262-parser-tests"--specifically the tests labeled "fail". We would likewise eschew any sort of meaningful naming convention (likely also omitting metadata for the same reasons) and place all files in some generic directory like test/language/violations or test/not-language. I would still insist on strong naming for "positive" syntax tests and tests for early errors, but this could be a way to improve Test262's usefulness without sacrificing maintainability.

I'm not against doing something like this, importing the parser tests fully or partially. This would leave me with a question: how this merge represents a guarantee for a maintenance?

The file naming scheme and the metadata not helps for identifying coverage for the spec text, but also to locate something in the project. If anything changes in the spec text, these become tools to find and replace the respective tests.

"Negative" syntax tests for code that doesn't match the grammar are difficult to classify. Sometimes, a good location is fairly obvious. A negative syntax test for void; probably belongs in language/expressions/void. But this breaks down fairly quickly--where would you place a test for code like foo bar; or ]?

Having tests for non-language is a case where every change in the grammar - not only breaking changes, but even new language features - will demand a search in non-specific syntax tests. I'm thinking any random thing it can be types: String function foo() { return "1"; }. This is not a valid syntax, but if it changes in the feature, I would have to search through random files for anything similar. If we have some sort of tooling to deal with this, using AST or not, I'd be more compelled.

if Test262 can/should maintain tooling in Python and JavaScript

I'm fine to maintain tooling specially in JS, as long this is not required for consuming the tests, but useful for creating new or maintaining existing files.


@isiahmeadows:

A ton of tests are using eval unnecessarily

You're not wrong when you flag the need for improvements here. Most of these files are legacy we really want to update and consuming Test262 for parsers makes this fix even more appealing.

This was for a pretty simple script I wrote, but the workaround took about 3 days worth of work to do, so I'm obviously not happy with the way this turned out...

After 2+ years of work on this project I have to say my list of things I'd like to improve is still increasing. Most of the time using my non-work time with different sorts of scripts. I appreciate your dedication here.


What would I like to see happen?

  1. Don't use eval unless you have to.

+1

  1. List eval in features in the frontmatter whenever you use it.

+1. If we solve n. 1 first, this could be reevaluated, but I wouldn't mind flagging tests as this could be really helpful for tests using indirect eval calls.

I'll check what I can do to find any cases with direct and indirect eval calls.

  1. For each test that checks for static semantics (like the exceptions I noted above), if you must use eval, provide both an eval-based version and a static fallback.

If the spec can be verified without eval, the static should never be a fallback, but the main streamline. Eval should only come in in cases it's explicitly necessary, for other cases a test runner could simply wrap any content inside.

  1. As do expressions appear, assuming they still return completion values, start substituting those for equivalent eval versions as applicable. Parsers will understand the former, but they can't process the latter.

This is a matter we should discuss only after do expressions are well implemented in the projects consuming Test262 to avoid forcing an adoption of a new feature to test other features, old or not.


I'll try to work on item 1 first, which involves reviewing the already listed files. I'd also be very very thankful to anyone willing to contribute, please feel free to engage on this work as my availability is not more than 1 day per week for the next 2 months.

@dead-claudia
Copy link
Author

@leobalter

If the spec can be verified without eval, the static should never be a fallback, but the main streamline. Eval should only come in in cases it's explicitly necessary, for other cases a test runner could simply wrap any content inside.

This is covered by my first point pretty succinctly: "Don't use eval unless you have to". There are certain tests that require it, particularly those that test completion values of statements, but those are incredibly rare from what I observed. The few exceptions I listed in the original comment were about 50% of the tests I found where eval was truly necessary. (I think it was about half a dozen or so.)

This is a matter we should discuss only after do expressions are well implemented in the projects consuming Test262 to avoid forcing an adoption of a new feature to test other features, old or not.

Agreed, and it was more of a forward-thinking-style thing rather than a mandatory "do it now". Note the bolded part in what I told @jugglinmike in a reply - I'm fully aware of its status:

I'm fully aware it's a very long-term solution unlikely to be implemented soon beyond Babylon and friends. It's partially why I said "assuming they still return completion values" - I personally recall suggesting another interpretation method [...] in es-discuss, but it's a bit buried.

(I've been meaning to file that as a potential issue in the do expression repo, but just haven't gotten around to it.)

@leobalter
Copy link
Member

leobalter commented Nov 28, 2017 via email

@dead-claudia
Copy link
Author

@rwaldron I presume this was closed by mistake? (based on reading the relevant PR)

@jugglinmike
Copy link
Contributor

This issue was closed prematurely, but I think I've finished refactoring all the unnecessary uses of eval. This gave me an excuse to assign more meaningful names to the test files. That makes browsing the directories a little easier and it also helped identify gaps in coverage.

I may have missed a few, though I definitely skipped some intentionally. Some turned out to be necessary (particularly those concerning statement completion values). If anyone finds more, please let me know.

Here's hoping Test262 is a little more friendly for parsers!

@jugglinmike
Copy link
Contributor

Also, thanks to @leobalter and @rwaldron for reviewing the patches :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants