Less self-writing tests, more generated tests? #48

domenic · 2013-12-04T18:25:06Z

A lot of the tests, especially for 2.3, are created programmatically via various permutations of the objects and possibilities involved. This has always been problematic, and recently @jcoglan has run into it and discussed on Twitter.

It might be nicer if we generated test cases that were readable by themselves, somehow. Or just wrote them all out manually, I dunno.

This is probably necessary for adaptation to test-262 as well; I can't imagine the test harness there is nearly as flexible as Mocha, from what I've seen.

domenic · 2013-12-04T18:37:21Z

An easier approach might be to create a smaller suite of "representative tests" that are fully written out, such that you are pretty likely to fail the representative test for a given section if you fail the generated tests. Then people could use the representative tests as debugging aids. Sometimes they might miss a subtlety, and pass the representative test but fail the relevant generated ones, in which case, they have to do the extra legwork of digging through our levels of indirection. But ideally, the representative tests alone should be enough to guide you toward a correct implementation.

jcoglan · 2013-12-04T18:40:21Z

That's a good point, it might not be necessary to test every combination of factors but use pair-wise testing. Start with one 'canonical' or 'happy path' test, and have one test that varies each factor in turn. e.g. if you're testing a form validator, you check it passes with a valid set of inputs, then a test for each input, setting it to something invalid and checking the whole thing is invalid as a result.

I know I'm probably over-simplifying but I've used this approach a few times and it tends to produce good-enough code with massively reduced testing costs.

Also, thanks for listening despite my snark <3

briancavalier · 2013-12-04T19:22:09Z

To @jcoglan's point, there are a few generative testing tools for JS, like Crockford's JSCheck and gent. Typically they deal with inputs that are easy to generate like numbers, strings, etc. and they explore the input space (usually randomly) for some number of iterations. Not sure how easy it would be to use such a thing for these tests, but when I've used them, they've drastically cut down on the sheer volume manual test writing. Sometimes it takes longer to devise how to generate the inputs, tho, like when you need more custom things like strings in a specific format, etc.

Anyway, it could be a good direction, but seems challenging for promises :) It'd sure be interesting to try it, though.

wizardwerdna · 2013-12-04T23:43:03Z

There are two clear styles for programmatic testing, the modern functional
style, which is to programmatically test as large a survey of the problem
space as is practical, and the modern tdd style, which is to survey so much
of the space as is necessary to drive code for triangulation.

Each suite has its benefits and failings, and there is a strong range of
styles in-between the two. The present test suite, particularly in its
current incarnation is of the former style, which makes a great deal of
sense for the purposes of a standards-compliance verification scaffold.

That said, it does have failings worthy of note:

when something goes wrong, EVERYTHING goes wrong. The first survey
through the tests typically generates kazillions of error messages.
Usually, the problem at the beginning is the adapter, when everything goes
kablooey. But then, a typical step will yield lots of messages anyway.
This is particularly true of the nuanced 1.1 stuff, which is admirably
comprehensive, but has many many tests covering those new issues.
The next step is try to figure out just what is going on. Now, the issue
is how to parse things out. What's wrong and what should I do first. It
is actually terrific that tests latch to quoted portions of the standard.
That said, it is difficult to figure out how bad things are. Its not a
bad idea to start developing a "strategy guide" for those beginning,
explaining how to approach the process of confirming conformance.

My experience with bringing a promises-1.0/compliant suite like covenant to
current was like others here. I almost wanted to give up, before realizing
that the initial total failure had to do with the adapter change. This
reduced the problems by 50%, and I almost wanted to give up, before
realizing that the test coverage of 1.1 issues is about 1000 times deeper
than for 1.0.
Then I noticed that a great deal of the test results addressed the same
issue, here and there. I fixed one or two clear limitations, and reduced
the problems by another 50%, and I almost wanted to give up, before
realizing that a kazillion errors meant maybe two or three "features" that
needed addressing.

Then I realized that I had only spent a bit of time doing the work that
needed to be done, and only had a kazillion errors left.

So, the first thing I did was add one or two tests to my tdd suite, getting
red results. Turned off the depressing promises A++ testing, and drove out
the code to make them green. Back to the big suite.

After only a few passes, I was reduced to a handful of test categories,
where I ended up spending most of my time until all green.

In other words, I used the suite to generate some noise, parsed the noise,
wrote some tests and drove them out in my tdd world, and got some more
noise. It really wasn't that bad an experience (in retrospect, at the
time, I nearly threw a few laptops out the window).

It is a terrible mistake to use the suite for tdd-style development. This
suite is not for unit-testing, it is for validation. At the end of the day,
it makes a lot of noise for the subtlest of errors, which is why its a good
validation suite, and a TERRIBLE vehicle for driving out conformant code.

Could the experience be made better? Absolutely. A breakout of "big
picture" and "happy line" testing would be useful to folks starting out, in
that it gives them the warm fuzzy of knowing they are close, even though
the devil is in the details. But I'm not sure how much of the original
pains I felt would truly be alleviated. I would support the
reorganization, so long as the present organization of targeting all tests
to the corresponding standard portions is not lost. That said, we should
focus on the ease of maintaining the suite as much (or more) than the ease
of using it.

I believe that thinking about the suite as a functional-test-style suite
rather than a tdd-suite is the proper focus. I am wondering if the best
way to develop an ease of use tool is to actually start out by specifying
an adapter and driving out by tdd a suite of promise tests. That might get
a nice minimal set for these purposes. It would be duplicative, but the
tests would more naturally follow the style for which folks are wishing.

I am looking at the angular.js $q library and thinking about what it will
take to make it 1.1-compliant and more flyweight (it uses module pattern
with a whopping 900 bytes per promise footprint. And as before, I am ready
to gouge my eyes out...

On Wed, Dec 4, 2013 at 11:22 AM, Brian Cavalier notifications@git.luolix.topwrote:

To @jcoglan https://github.com/jcoglan's point, there are a few
generative testing tools for JS, like Crockford's JSCheckhttp://jscheck.organd
gent https://github.com/briancavalier/gent. Typically they deal with
inputs that are easy to generate like numbers, strings, etc. and they
explore the input space (usually randomly) for some number of iterations.
Not sure how easy it would be to use such a thing for these tests, but when
I've used them, they've drastically cut down on the sheer volume manual
test writing. Sometimes it takes longer to devise how to generate the
inputs, tho, like when you need more custom things like strings in a
specific format, etc.

Anyway, it could be a good direction, but seems challenging for promises
:) It'd sure be interesting to try it, though.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/48#issuecomment-29836070
.

jcoglan · 2013-12-05T00:39:34Z

@wizardwerdna This chimes entirely with my experience. From an impl that passed the 1.3.x tests, I started out passing < 150 of the 2.0.x tests. Replacing fulfill with resolve in the adapter, and changing my code so I only access result.then once (which I still don't understand the need for), and making all callback execution async has got me down to ~400 failures but I'm mired in 2.3.3's generated tests of interop with thenables, which is really confusing.

It's certainly a case of each missing feature causing 100s of test failures, and it's hard to narrow down which tests to run to drive out your implementation. I had a much better experience with 1.x.

wizardwerdna · 2013-12-05T01:04:01Z

If you are a TDD-er, my advice would be to ignore the suite for the moment,
and go about implementing differences between the 1.0 and 1.1 standards.
Write tests for key differences, and make sure you are passing them. Then
run the suite again and see if it inspires a new failing test, do that, and
so forth. At the outset, I would not look for anything but improveents in
the total numbers passed until you get to a manageable amount.

One piece of guidance is that, for the assimilation portions of
[[Resolve]], the suite is enormously sneaky, sending in thenables designed
to kill you. Its more like a security suite than anything else. You need
to ASSURE that only one callback is called, and only once. Not by glancing
at your code, but by assuming that they were sent in with extreme prejudice.

How you do that is up to you. I actually wrote a functor, "once" with a
closure to a boolean, and passed my callbacks to untrusted promises with
once. once(f) will call f when applied, but afterwards once(anything)
will return undefined without calling the callback. I write this not to
propose how you would solve the problem, but to give you a sense that small
phrases in the spec are rigorously tested in a way they were not beforehand.

And those of you who wrote those killer tests, you are very sneaky, nasty
people. Thank god for you.

On Wed, Dec 4, 2013 at 4:39 PM, James Coglan notifications@git.luolix.topwrote:

@wizardwerdna https://github.com/wizardwerdna This chimes entirely with
my experience. From an impl that passed the 1.3.x tests, I started out
passing < 150 of the 2.0.x tests. Replacing fulfill with resolve in the
adapter, and changing my code so I only access result.then once (which I
still don't understand the need for), and making all callback execution
async has got me down to ~400 failures but I'm mired in 2.3.3's generated
tests of interop with thenables, which is really confusing.

It's certainly a case of each missing feature causing 100s of test
failures, and it's hard to narrow down which tests to run to drive out your
implementation. I had a much better experience with 1.x.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/48#issuecomment-29861381
.

domenic · 2013-12-05T03:08:41Z

One piece of guidance is that, for the assimilation portions of [[Resolve]], the suite is enormously sneaky, sending in thenables designed to kill you. Its more like a security suite than anything else. You need to ASSURE that only one callback is called, and only once. Not by glancing at your code, but by assuming that they were sent in with extreme prejudice.

This makes me smile :D

stefanpenner · 2013-12-05T04:37:07Z

@domenic i am also glad. As some (or lots ) of those scenarios are derived from real breakage in real apps. I am glad that the effort of the few, can benefit the masses.

wizardwerdna · 2013-12-05T05:23:48Z

Indeed. These tests are b@#2-breakers, digging deep into the nuances of
promises. They reveal how the subtlest tradeoffs can destroy all but the
most vigilant code. Yes, they could be designed to be clearer in
diagnosing these awful weaknesses in our system -- but thank god they are
there to let us know where we are awfully weak.

On Wed, Dec 4, 2013 at 8:37 PM, Stefan Penner notifications@git.luolix.topwrote:

@domenic https://github.com/domenic i am also glad. As some (or lots )
of those scenarios are derived from real breakage in real apps. I am glad
that the effort of the few, can benefit the masses.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/48#issuecomment-29871054
.

stefanpenner · 2013-12-05T15:54:43Z

i found using mocha's grep feature to isolate slabs of tests has made it easier to debug issues and update the implementation i maintain, when the spec changes.

domenic mentioned this issue Jan 31, 2014

Question about the tests #52

Closed

calvinmetcalf added a commit to calvinmetcalf/promises-tests that referenced this issue Mar 17, 2014

(promises-aplus#48) - inline some of the test calls

f10b45c

domenic mentioned this issue Mar 18, 2014

(#48) - inline some of the test calls #56

Open

domenic mentioned this issue Oct 23, 2014

Request for Comments on Promises/A+ test translation tc39/test262#56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less self-writing tests, more generated tests? #48

Less self-writing tests, more generated tests? #48

domenic commented Dec 4, 2013

domenic commented Dec 4, 2013

jcoglan commented Dec 4, 2013

briancavalier commented Dec 4, 2013

wizardwerdna commented Dec 4, 2013

jcoglan commented Dec 5, 2013

wizardwerdna commented Dec 5, 2013

domenic commented Dec 5, 2013

stefanpenner commented Dec 5, 2013

wizardwerdna commented Dec 5, 2013

stefanpenner commented Dec 5, 2013