Atomate v2: high-level unit testing strategy #289

computron · 2019-05-30T18:13:19Z

For atomate v2, we all agreed that we want to ditch the current strategy of having "fake" VASP runs in a sort of pseudo-integration test. The tests are hard to write, hard to debug, and cause bloating in the repo.

Moving forward, we propose a two-tier testing strategy:

Tier 1: Atomate contains unit tests for anything that requires a unit test. This means that any new code / functions that are written generally requires a unit test, but simply connecting together existing functions do not require a unit test. Example 1: You write a custom FireTask that copies VASP files according to some rules - this requires a unit test since it is new functionality. Example 2: You write a FireTask that simply loads an existing VaspInputSet and writes VASP input files. This does not require a unit test since it is just calling existing functions and classes that are already tested elsewhere, e.g. in pymatgen. Example 3: You write a Firework that just connects together a bunch of Firetasks. This does not require a unit test since the Firework itself is just trivial code based on other code that is tested (the Firetasks). Example 4: you write a powerup that modifies some aspect of a workflow. This requires a unit test to make sure that the desired modification occurred after running the powerup.
Tier 2: Each workflow in atomate must contain an integration test that runs "real" VASP. These tests would need to be run periodically and on one or more machines that are capable of running VASP. Some kind of system would need to be in place for making sure these tests get run. A side benefit of these tests is that it will provide a real working example on how to run a workflow of that type.

So essentially:

more unit-style tests
ditch "fake" VASP integration tests
add real VASP integration tests

Comments on general strategy?

mkhorton · 2019-05-30T18:19:55Z

I am fully on board with this, strategy and examples both.

Doing integration tests with real VASP are the only way we'll spot issues due to compilation errors or regressions in future VASP versions also. I think having real VASP integration tests are a necessary requirement to ensure quality of MP data in the future.

The only question for me is how we test builders or ToDb tasks. But I think the builders are perhaps a separate discussion.

shyuep · 2019-05-30T18:22:52Z

I would say that there are three levels. Tier 1 is as defined by Anubhav. Tier 2 should be “fake” VASP. The idea here is to ensure that the logic of the code is correct and would always pass given a constant VASP version. Tier 3 is testing with actual VASP to ensure that VASP v20000 gives the same results with the same parameters as VASP v5.x.

shyuep · 2019-05-30T18:23:07Z

I would exclude Tier 3 from CI testing.

mkhorton · 2019-05-30T18:27:36Z

I think fake VASP adds a large engineering burden for minimal reward, that would mostly be caught by actual VASP integration tests regardless. The important thing I think is to write the integration tests in a way that is sensible and robust to unimportant changes in output.

bocklund · 2019-05-30T18:27:58Z

@computron Totally agree.

@mkhorton I think testing the API on a set of example files or an example database built as part of the test suite is reasonable for builders. ToDb tasks could tested by returning a dict if no DB file is specified. We could just check for key properties of interest or properties that are guaranteed to be pulled in.

@shyuep yeah, on the discussions we've been having on the weekly meetings we decided that your Tier 3 basically cannot be run on CI because of the vasp install and potential time requirements to run all workflows, even for simple cases in the time window of the CI system (e.g. 50 mins on Travis). We have been discussing that your Tier 2 is what we have now. The tests are hard to write and easy to break and the idea is that they don't actually test that much (except for things being wired up correctly, which probably won't break). Better coverage of the low level stuff should make your Tier 2 tests redundant because yourTier 3 tests make sure every thing is wired up correctly.

pytest

Can we switch to pytest for all new tests? It's so much more convenient for writing tests, and it's backwards compatible with unittest so we don't have to change any old code (though it might be nice for maintainability sake to transition for any refactoring possible).

A key benefit of pytest is that we can include the integration tests in this repo and mark them as integration tests and so running the tests for real with all the integration tests should use the same script as regular CI except we pass a flag to pytest to run the integration tests (easier to maintain)

mkhorton · 2019-05-30T18:30:54Z

nose is deprecated anyway right?

bocklund · 2019-05-30T18:31:39Z

Yes, it is not supported anymore

shyuep · 2019-05-30T18:42:31Z

@mkhorton I disagree that "fake" VASP adds engineering burden. I would say that a basic integration test would be running a full Firework, but with "fake" output files being put into the directory. Given that the Tier1 tests do not test integration, the fake VASP tests minimally ensures that if you have a FW that does an OptimizeTask, Move some files, StaticTask, all three joining components are still working when they are combined. I strongly disagree with the notion that because FWs are "trivially" stringing FireTasks together, it means that this trivial code does not need to be tested. Far too often, we assume something is trivial (e.g., existence of a certain output file or that files are moving to the correct location) when it is not. The fact that LogicA and LogicB are sound does not mean that LogicA+B is sound.

There is of course no need to go overboard and do a massively overengineered solution with different randomized files or output.

computron · 2019-05-30T22:38:13Z

@shyuep We initially had the "intermediate" tier of fake VASP tests for all the reasons you mention here. I agree completely that testing connections is important and often the most important test for these workflows (note that these tests are generally most useful for testing connections in between Fireworks, versus in between Firetasks). It is also nice to have connection tests that are independent of VASP (unlike the full running VASP integration test). Finally, another reason is that such a tier can at least run on CircleCI, as opposed to actually running VASP which needs some kind of separate testing framework, making them more automatic. Several people also did work to try to make writing such tests easier, like adding a "use_fake_vasp" powerup and creating a new UnitTest base class to inherit from for these fake tests.

Yet, in the past couple of years, it's become clear that it's just not working out in practice. Whether or not the idea of fake VASP test is conceptually simple to you, people struggle with it and delay writing them. It's just another set of things for them to learn - i.e., not only how to write a "real" VASP workflow (which they can often barely do), but then afterward also how to write a proper "fake" VASP workflow (which they usually delay and delay and requires a lot more code and file commits). People are just not good at writing this code. In contrast, I think people will be open to having a full integration test for all workflows, since the usage and point of that is clear to everyone.

Here is also a specific example of what I think will happen if we keep around "fake" VASP tests and also add the "real" VASP test as requirements:

Person X writes a workflow, submits a PR without any tests (common)
I ask them for all 3 types of tests before merging
Person X begrudgingly adds some unit tests and, after more prodding, a full "real VASP" integration test. Then they beg me to accept the PR so they can get on with their life, since the workflow is clearly working / operational (since it passes the "real" VASP test) and they don't have time to get a working "fake" VASP test also in place
If I stay adamant and tell them "no fake VASP test, no merge", then the code will simply languish. Person X will tell their advisor they finished the atomate workflow as required, but there is just one more test they need to write before it's officially merged - and they will convince their advisor that they have other things (like that paper that needs to be written) that are more important than this for the upcoming week than finishing writing this last test. This state of affairs will continue until Person X graduates. If I merge the workflow before "fake VASP" test is in there (e.g., because we need it for MP or Person X is graduating, etc) then that opens the door for everyone else to not want to write such a test either.

Essentially, I agree with @mkhorton that people consider "fake VASP" tests to be burdensome and it has really slowed down people's development. Some kind of integration test is definitely essential however. I think the unit test + "real VASP" strategy is the best way forward which will allow us to go faster. If we really get hit by a lot of cases where having the fake VASP test would have avoided a problem, then we need to either add those tests back, or we need to see what's going on in our "real VASP" strategy that is not working to capture those cases.

shyuep · 2019-05-30T23:09:08Z

@computron RealVASP is no easier to write than FakeVasp. And I would say running RealVASP violates a critical principle of testing, which is that the developer is responsible for code that you manage, not code that is written by someone else. RealVASP tests not just what atomate's logic is, but also VASP's coding, which is entirely up to the powers that be at VASP. It would also mean that if you run a test using VASP 5.1 and I run the same test using VASP 5.2, there is a real possibility that a test would succeed for you and fail for me. That means essentially that atomate is no longer just dependent on Python and PMG versions (and FireWorks and other dependencies), but also VASP version.

My suggestion would be that you insist on someone giving:
i. FireTask unit tests.
ii. An integration tests with a set of pre-computed outputs, that a so-called FakeVASP can simply "simulate" a run. This also tells me that the developer has done such calculations and I can trust his/her outputs to a certain degree.
iii. RealVASP will not actually require additional code from the developer. You simply run the exact same integration tests with a command line/environment flag that specifies real VASP should be used to generate the output vs the simulated copy operation and all the tests remain the same! In other words, this flag will disable the so-called use_fake_vasp power-up. This can easily be executed by asking use_fake_vasp to check for an environment variable "USE_REAL_VASP" and do nothing if that variable exists. That said, I don't think the powerup approach is very user-friendly (see later point).

Ultimately, testing is about reading in the output files (whether it comes from fake or real VASP) and checking that the outputs make sense (with assert statements). If the so-called REAL VASP tests amount to no more than just running the workflow and checking that the workflow has completed, that is not testing.

Maybe my simpler question to you is: what is it about real VASP that you think would make it easier to write a test for than fake VASP? A person who thinks FakeVASP is difficult is never going to find real vasp tests more palatable, especially if your REALVasp tests are only going to be run once a day (or a week) and the developer has to wait for real vasp to complete (which minimally takes hours) before he knows whether the test has passed.

That said, I do think the FakeVASP can be made a lot easier if it is integrated better. There is no reason to use a powerup. You can simply provide a command line script named "vasp" that simulates vasp-like behavior. That way, all the fireworks remain the same.

shyuep · 2019-05-31T02:05:01Z

Let me perhaps sketch out why I think FakeVASP is perfectly doable and readily substitutable with RealVASP.

Let's start by defining a helper testing class:

class AtomateWFTest(unittest.TestCase):

    VASP_OUTPUT_DIR = None

    def setUpClass(cls):
        if not os.environ.get("USE_REAL_VASP") and cls.VASP_OUTPUT_DIR is not None:
            os.environ["VASP_OUTPUT"] = cls.VASP_OUTPUT_DIR
            os.environ["PATH"] = "/path/to/fake/vasp/:" + os.environ["PATH"]  #alternatively, this can be simply set as part of CircleCI.

An actual WF test will look something like this.

class WFTest(AtomateWFTest):
    VASP_OUTPUT_DIR = "example_output"

    def test_run_workflow():
       <run the workflow, where somewhere a vasp call will be run.>
       self.assert(some checks of the output)

You then have a script named vasp somewhere in the atomate repo (similar to pymatgen command line dependencies). That script will look something like this (the logic can be a bit more complex of course):

#!/bin/bash

cp -r $VASP_OUTPUT/* .

In this way, developers just need to concern themselves with defining the sample vasp output location and the specific things the workflow need to test for. When you want to do the REAL vasp tests, you simply set USE_REAL_VASP to True in the environment and have the vasp executable somewhere.

The above will only work with serial testing of course, but as far as I know, atomate does not do parallel testing.

mkhorton · 2019-05-31T02:18:15Z

Hmm, I may be coming around to the idea of fake VASP. I think perhaps my issue is more where the fake VASP tests are in the repo/CI, and how it's integrated.

However, to respond to this point:

And I would say running RealVASP violates a critical principle of testing, which is that the developer is responsible for code that you manage, not code that is written by someone else.

I think it's essential we test real VASP. In general, I completely agree that we do not test code we don't manage ourselves. However, for this specific use case, I think if we don't have integration tests that involve actually running VASP, then atomate CI can give us a false sense of security. If we're running 100,000 calculations for Materials Project but something has changed in VASP and these are bad calculations, we need to know that and be able to detect it. Pragmatically, I think this just has to be a part of how we approach testing.

I do think that maybe there's a way we can incorporate fake and real VASP in a similar integration testing framework however, so that we can write a single set of tests. And this could live somewhere outside the core atomate repo.

I will take some time to read the comments above and think some more before responding further.

shyuep · 2019-05-31T02:26:12Z

@mkhorton Just a note that I am completely on board for REAL vasp tests. Given the use of atomate as the backbone for MP, it is, of course, critical that when MP updates VASP, the results do not change. I am only objecting to it being the only integration test. Also, if tests pass on CI but fail on real vasp test machine, we immediately know that real vasp changes are likely the issue.

bocklund · 2019-05-31T02:26:19Z

Isn't what @shyuep proposed is basically how it works now? Except fake vasp is a powerup (currently) vs. a environment variable (here). I see some validity in the idea of a fake vasp in that you want to see that the workflow graphs are executed as expected, but I don't think there's a maintainable way to check "results" in the way suggested and how we are doing it now.

@mkhorton I agree. @shyuep mentioned

RealVASP tests not just what atomate's logic is, but also VASP's coding, which is entirely up to the powers that be at VASP. It would also mean that if you run a test using VASP 5.1 and I run the same test using VASP 5.2, there is a real possibility that a test would succeed for you and fail for me. That means essentially that atomate is no longer just dependent on Python and PMG versions (and FireWorks and other dependencies), but also VASP version

As we discussed during the atomate meeting a week or two ago: we actually do want to test these things. The integration tests should prove and give confidence that atomate gives you scientifically correct results consistently over time. Ultimately, things that break in the integration tests might actually mean that a fix is needed for a dependency, but it's within scope of atomate to promise that the results will be stable across dependency versions. The regularly run unit tests should give faith that everything runs as intended.

JosephMontoya-TRI · 2019-05-31T04:03:14Z

I like the idea of keeping fake vasp, but I'd like to lighten the contract relative to what we do now. We have "inputs" and "outputs" for every vasp firework that's run, which seems unnecessary. It's not particularly painful to generate it, but it can be to organize all of it. It would be great if there were a way to automate generation/organization of test data in the prescribed directory structure.

shyuep · 2019-06-05T01:40:56Z

@bocklund My final note on this subject - the whole "we hate fake VASP tests" is premised on the false idea that somehow real VASP tests are "easier" for the developer than fake VASP tests. All I am pointing out is that (a) it is not easier - if you have to write real VASP tests, those can be easily used in fake VASP environments, and (b) fake VASP runs a lot faster than real VASP and for a developer who is just tweaking code, I would like an answer of whether I broke anything to come in O(minutes) vs O(hours/days/weeks).

Also, let me paint the picture of what is likely to happen. Some group of people is going to spend an inordinate amount of time setting up the REAL vasp test machine (since it can't be done on CircleCI). I would say weeks to get it working as you want to and given MP's list of priorities, we will see how that goes. They compile VASP v5.4.current on that fancy test mahine. All tests run on VASP v5.4.current and they go on their merry ways. 5 years down the road, every single person is busy with coding new Science-paper feature X and forgot all about updating that v5.4.current, even though the real world has moved to VASP v10.0.

So while in a ideal nirvana, I am on board with the idea that if atomate is tested on v5.4, the workflows are guaranteed to give a certain result with v5.4, i.e., atomate essentially pegs itself to a certain VASP version like all other dependencies, I am not convinced that we are attacking a real problem that urgently needs a solution given the amount of effort that will be required to make it happen. I would think something more obvious like me moving from PSP PBE52 to PBE54 is more likely to cause issues. There are literally a gazillion things we can peg dependencies to, and VASP version is very low in my mind on that list.

computron · 2019-06-05T01:55:25Z

Hi all

The proposed way to have a "dual" integration test that can be run either using fake or real VASP is already how it currently works (and has since the beginning). In the current code, there is a VASP_CMD parameter at the top of the unit test. If this is set to None it does the fake VASP test. If this is set to some path to a VASP exe, it actually executes VASP but still performs the same test as before. Although, looking at most of the tests that were written, perhaps I was the only one that actually wrote the test to respect this...
Given that I think most of us would want all 3 types of tests if they were easy to write, I think the crucial question is what makes it difficult to have "fake" VASP tests. It's been awhile since I wrote these, but here is my list:

running fake VASP requires a VASP "emulator", i.e., something that does the same thing as running VASP. The generic fake VASP emulator is pretty straightforward. It checks whether certain input tags are correct in the INCAR, and if so, it copies pre-determined VASP output files to the current directory. The code for this is also fairly clean. But, for the NEB VASP runs, apparently a custom VASP emulator is needed (probably because the directory structure for NEB runs is different). This adds some programming overhead for that team to write, and also I find the fake NEB VASP to be somewhat less clear (e.g., there is a _get_params() function which I don't really understand what it's supposed to do). However, for the most part this doesn't seem like a huge problem - just a pain from time to time for certain workflows.
you need to give the fake VASP emulator a list of directories where the "pre-determined" input and output files are located. While it might take a little time and knowledge to set this up, it really doesn't look that bad to me. This does however look like a pain when there are a ton of runs in a workflow - it's just a lot of directories and files to manage. I'm not sure if there is a way to only give fake inputs/outputs to a few crucial runs in the workflow instead of every single VASP calc in the workflow.
almost non-existent documentation on how to write an atomate test. If there were clear, step-by-step docs explaining both the philosophy and procedure of fake VASP tests, these should be more straightforward to write.

Many of the issues in writing atomate tests are really issues that are independent of fake VASP. The fake VASP is maybe just an added layer of complexity that adds another straw to the camel's back, and makes it "one more issue than I feel like dealing with". Some of the difficulties in writing VASP tests, that are independent of fake VASP, are:

how to set up and connect to test databases, both for Fireworks and for VASP tasks. This requires configuration of the unit tests to the test system
currently after parsing a VASP output a Firework can either put that output in a database (tasks collection) or write to a file (db.json). Testing for both of these possibilities is somewhat of a pain.
More fundamentally - in terms of making sure the INCAR is correct, knowing which INCAR parameters should be checked is medium hard. Same with checking the k-point mesh - if pymatgen changes upstream to give a more dense k-point mesh, should our integration test fail? Should we be testing a minimum k-point mesh or a specific mesh?
Also more fundamentally - knowing what outputs to check and to what precision is also hard. For example, in a band structure workflow one might want to check that the final band gap is correct. But, due to VASP compilation changes, VASP version changes, or various other factors (e.g. small changes to pymatgen input sets), these can differ. Should we be testing the final band gap to 1 decimal place, 2 decimal places, 3 decimal places, etc?

Also, it is worth pointing out that we also discussed in the past having a VASP test suite - e.g., a matrix of structures and workflows - and making sure that when updating the code infrastructure on MP, that matrix of calculations gives the same result as the previous set. But for time reasons, this was never accomplished.

Overall here are my major conclusions:

I don't see many fundamental difficulties with having a fake VASP test like we have now
The difficulties in writing atomate unit tests is more a sum of many small problems rather than any big problem. It is just too many little things for people, especially newbies, to wrap their head around.
I see documentation for developers, including clear step-by-step instructions on how to write atomate tests, as being a major help to the current situation.

So I would lean towards going with what @shyuep said. We will essentially stick to what we have today, with maybe some minor modifications and clean ups. But the big change will be much better documentation on how to write atomate tests.

Thoughts?

itsduowang added enhancement improvement reported issues that considered further improvement to atomate and removed enhancement labels Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomate v2: high-level unit testing strategy #289

Atomate v2: high-level unit testing strategy #289

computron commented May 30, 2019

mkhorton commented May 30, 2019

shyuep commented May 30, 2019

shyuep commented May 30, 2019

mkhorton commented May 30, 2019

bocklund commented May 30, 2019

mkhorton commented May 30, 2019

bocklund commented May 30, 2019

shyuep commented May 30, 2019 •

edited

Loading

computron commented May 30, 2019

shyuep commented May 30, 2019 •

edited

Loading

shyuep commented May 31, 2019 •

edited

Loading

mkhorton commented May 31, 2019

shyuep commented May 31, 2019

bocklund commented May 31, 2019 •

edited

Loading

JosephMontoya-TRI commented May 31, 2019 •

edited

Loading

shyuep commented Jun 5, 2019

computron commented Jun 5, 2019

Atomate v2: high-level unit testing strategy #289

Atomate v2: high-level unit testing strategy #289

Comments

computron commented May 30, 2019

mkhorton commented May 30, 2019

shyuep commented May 30, 2019

shyuep commented May 30, 2019

mkhorton commented May 30, 2019

bocklund commented May 30, 2019

pytest

mkhorton commented May 30, 2019

bocklund commented May 30, 2019

shyuep commented May 30, 2019 • edited Loading

computron commented May 30, 2019

shyuep commented May 30, 2019 • edited Loading

shyuep commented May 31, 2019 • edited Loading

mkhorton commented May 31, 2019

shyuep commented May 31, 2019

bocklund commented May 31, 2019 • edited Loading

JosephMontoya-TRI commented May 31, 2019 • edited Loading

shyuep commented Jun 5, 2019

computron commented Jun 5, 2019

shyuep commented May 30, 2019 •

edited

Loading

shyuep commented May 30, 2019 •

edited

Loading

shyuep commented May 31, 2019 •

edited

Loading

bocklund commented May 31, 2019 •

edited

Loading

JosephMontoya-TRI commented May 31, 2019 •

edited

Loading