Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Broaden coverage of accessibility tests with Playwright #1378

Closed
trallard opened this issue Jul 3, 2023 · 15 comments · Fixed by #1408
Closed

MAINT: Broaden coverage of accessibility tests with Playwright #1378

trallard opened this issue Jul 3, 2023 · 15 comments · Fixed by #1408
Assignees
Labels
kind: maintenance Improving maintainability and reducing technical debt tag: accessibility Issues related to accessibility issues or efforts

Comments

@trallard
Copy link
Collaborator

trallard commented Jul 3, 2023

The current set of accessibility tests with Playwright and axe-core only check for the Kitchen Sink pages.

We need to extend these to cover the following:

Edit: these are skipped

@trallard trallard added tag: accessibility Issues related to accessibility issues or efforts kind: maintenance Improving maintainability and reducing technical debt labels Jul 3, 2023
@trallard trallard self-assigned this Jul 3, 2023
@12rambau
Copy link
Collaborator

12rambau commented Jul 4, 2023

For the execution libs, we don't have control over how jupyterlites is displayed so I expect a11y failure.
Also jupyter-sphinx use the exact same css as myst-nb so if one works, the other will as well.

@trallard
Copy link
Collaborator Author

trallard commented Aug 8, 2023

Adding a note for visibility. @gabalafou will be working on this in the upcoming weeks.

@gabalafou
Copy link
Collaborator

Working on this issue has raised fresh questions in my head about accessibility testing and this repo. The following are some of my thoughts about using Lighthouse versus Playwright.

When I worked on #1260 - Accessibility test Kitchen Sink with Playwright, I went for Playwright because I was already familiar with using it for automated accessibility tests. It also has a Pytest plugin and a Python API, which I thought would make it more friendly for this repo.

However, now that I am revisiting these themes, I wonder if I was too quick to abandon Lighthouse.

In my mind the guiding questions should be:

  1. Do we need to write custom accessibility tests, or do we limit the scope of our accessibility testing to running Axe-core scans on a set of pages?
  2. How extensive are our reporting needs and are we willing to build what we need from scratch if necessary? (To date, I haven't found any great free and open source libraries for reporting/displaying Axe-core test results.)

I want to stress that Lighthouse's reporting is a big factor in my head. But I guess it depends on how we imagine the tests being used in the future. I don't think that our current Playwright + Axe accessibility test output is particularly friendly. Thinking towards the future, after we have fixed all of the current Axe-detectable accessibility issues, who is going to have to deal with future Axe failures and will they be inclined to do so from the CI system's outputs, or will they find the outputs overwhelming or too difficult to bother trying to understand them? Lighthouse has some very nice reporting features, for example it will take a screenshot of the part of the page that fails a particular accessibility test.

Another potential Lighthouse advantage is that it bundles other kinds of testing that might be relevant to PyData Sphinx Theme—namely, performance and best practices. That would benefit the participants in this repo by reducing the number of different tools and technologies that need to be learned and maintained. Not to mention the Lighthouse ecosystem already has GitHub Actions that you can use, so you can more easily integrate it with GitHub (such as displaying on a PR if the PR broke the accessibility of a certain page), and you don't have to think about configuring the runner to make sure it has all the needed dependencies and appropriate caching and such (on the other hand, between Tania and myself, we already have a pretty good amount of knowledge on how to do this). Lighthouse also provides a server (LHCI Server), so if we could get some cloud credits, we could run a Lighthouse CI server in the cloud and collect test results over time. With history, you can check if performance or accessibility improved over time. More importantly, you can also easily discover if a change was made that decreased performance or accessibility.

On the other hand, I don't think Lighthouse is very easy to customize. As far as I understand, when it comes to accessibility testing, all you can do is give Lighthouse a list of URLs and it will scan each of those URLs with the Axe-core library, and then collect and report the results. If we want more fine-grained control—for example, the ability to scan only parts of pages, or even just to configure which rules get run from the Axe-core library—then I think we have to roll our own.

So it's all the standard trade-offs between choosing something off-the-shelf versus building your own solution.

@gabalafou
Copy link
Collaborator

gabalafou commented Aug 14, 2023

To summarize, here are the pros and cons of using Lighthouse, as I see it.

Lighthouse pros:

  • Report UI (to display test results in a user-friendly way)
  • Tool consolidation (use Lighthouse for performance and accessibility testing)
  • Integration ecosystem (GitHub actions, for example)
  • (if we can run LHCI Server) History (compare test results over time)

Lighthouse cons:

  • Limited configurability (either impossible or very hard to configure the accessibility tests)

@drammock
Copy link
Collaborator

Personally I don't mind viewing CI logs to get the relevant failure information --- it's what we have to do for ordinary pytest failures too. Also of note is that I already added a non-a11y test that uses playwright (to confirm that version switcher JSON data was correctly loading and getting styled properly). This is not to say that I'm strongly leaning one way or the other --- I can see the benefits of Lighthouse's out-of-the-box solution and reporting features.

Would it be a nightmare to do both? Lighthouse for the "first pass" and custom Playwright tests for anything that we know that Lighthouse won't catch?

@trallard
Copy link
Collaborator Author

trallard commented Aug 14, 2023

Thanks @gabalafou, thoughts below

When I worked on #1260, I went for Playwright because I was already familiar with using it for automated accessibility tests. It also has a Pytest plugin and a Python API, which I thought would make it more friendly for this repo.

Yes, that is correct, but also there is #585 where there was already the proposal to use Playwright for testing (as there are several testing gaps), and more recently, Dan worked on adding some tests for the version dropdown.

How extensive are our reporting needs and are we willing to build what we need from scratch if necessary? (To date, I haven't found any great free and open source libraries for reporting/displaying Axe-core test results.)

Reporting does need to be improved - this is one of the items I have been meaning to work on, but the many spinning plates are tricky sometimes.
But we need to work on this to make it easier to follow and more useful for contributors and maintainers; this is entirely within our scope of the grant, and I suppose we would need to assess, scope, and plan what needs to be done in this area. Still, there are significant improvements we can make without it being a considerable lift.

Considering that the test primary users/consumers are the maintainers of this theme (and not end-users) and that we already have to check CI logs regularly, this would not be an annoyance. However, as I said before, the reporting format could be improved.

I want to stress that Lighthouse's reporting is a big factor in my head.

I have already shared some thoughts on Lighthouse reporting with @gabalafou separately. Still, while it does have some nice features like screenshotting, and its reports are visually nice, the way the accessibility reporting is done in terms of an absolute percentage is / can be grossly misleading. What does a 90 vs. an 80 score look like regarding issues found/severity? Because not all accessibility violations are weighted the same.

With history, you can check if performance or accessibility improved over time. More importantly, you can also easily discover if a change was made that decreased performance or accessibility.

Here is where I see the same data granularity issue. Since the report is an absolute score (1-100%), how confident can we have that an 80% now vs. 80% in three months does not imply regressions on some already fixed accessibility issues but improvements in other areas? I find the current reports challenging to navigate, perhaps because we are consuming them through GitHub artifacts (more details below).

I think here is where Playwright + axe core is much better suited, along with the customization and extensibility (write tests specific to the theme.), which goes in hand with the:

Do we need to write custom accessibility tests, or do we limit the scope of our accessibility testing to running Axe-core scans on a set of pages?

I would undoubtedly prefer writing more accessibility tests to include tab trapping, focus, and potentially other cases I am missing. The theme is also evolving, and new components might be added or layouts so having that extensibility and customization options is good from that point of view. We are early in our accessibility journey with the theme. While I cannot say for sure what type of tests we will need to add in the future, this, with the very limited options of Lighthouse customization, can be a significant blocker or limiting factor for us in the long run.

Lighthouse pros:
Report UI (to display test results in a user-friendly way)

Yes, the rendered HTML is nice, but not very user-friendly or intuitive. This may be the current configuration we have. Still, since the only way to interact is by downloading the reports from the GitHub action artifacts, we end up with many HTML files (that, at first glance not easy to identify what each one is for - i.e., admonitions, images, etc.)

This is how the files look:

lighthouse-results

Is this customizable? or is this one of those things that would be too hard to do for little gain?

Also, we'd need to know if we can run the accessibility tests on the dark theme because it seems to be only running on the light theme.

Integration ecosystem (GitHub actions, for example)

This may be something I am not too worried about; I spend a lot of time on automation and CI/CD spaces (and I am biased as a consequence). But if the only pro here is that there is a GitHub action that handles the setup and dependencies, I would not class this as a pro for Lighthouse but rather someone in the community took the time and care to create an action.


Overall we can keep both - at least; we said initially we would do that anyway.
But I am generally against relying on Lighthouse **only ** (based on the limitations on data granularity and score transparency, extensibility, and customization).

And the concerns you've raised can be handled in one way or another.

@gabalafou
Copy link
Collaborator

Thanks so much for this detailed feedback to questions.

This allows me to move forward confidently with Playwright.

One possibility this discussion sparked for me that I hadn't considered before is perhaps that I develop down both directions (Lighthouse AND Playwright) simultaneously for a while until we hit a point where a clear winner emerges and then we axe the other one. Is that something we think would be worthwhile? If not, I'm happy to just move forward with Playwright development. All of the feedback and considerations taken together make Playwright seem like a solid choice to move forward on.

@gabalafou
Copy link
Collaborator

Could we resolve the "TBD" part of the PR description? Have we made a decision about whether those pages need to be added to the accessibility tests? If not, should we move that discussion into a separate PR? That way we can merge #1408 and allow it to close this issue, knowing exactly what it accomplished (one issue, one PR).

@drammock
Copy link
Collaborator

#1378 (comment) addresses the execution libraries I think? Namely that we shouldn't test them here. IDK about ablog... I guess if we're customizing the styling of ablog elements then we should probably test to make sure our customizations aren't violating WCAG. @trallard does that make sense to you?

@trallard
Copy link
Collaborator Author

Right now Ablog should inherit from our styling. I do not think we are doing any custom modification (I do not remember having to do any overwrites when doing the design system colour work) and a quick glance did not flag anything. So we should be safe skipping this too.

On the library side of things. Let's focus on Playwright and keep the Lighthouse as is to avoid making changes on both and ending with tho kind of working solutions and instead focus on a robust single approach (with Lighthouse as a fallback on the side).

@drammock
Copy link
Collaborator

closed by #1408

@gabalafou
Copy link
Collaborator

So if I understand correctly, we're leaving Lighthouse in place just in case we decide we need to backtrack, or is there some other reason?

Should we create a task to eventually remove Lighthouse if all goes well down the Playwright route?

@drammock
Copy link
Collaborator

So if I understand correctly, we're leaving Lighthouse in place just in case we decide we need to backtrack, or is there some other reason?

I would say "in case something goes wrong with playwright AND lighthouse gets better"

Should we create a task to eventually remove Lighthouse if all goes well down the Playwright route?

Probably not? Lighthouse does some stuff that playwright will never do (the performance testing) so I don't think it will ever be 100% redundant. And anyway it shows up in our CI list on every PR so it will be hard to forget that it's there --> it should be easy enough to revisit the "do we need lighthouse" question at any point in the future.

@gabalafou
Copy link
Collaborator

gabalafou commented Aug 18, 2023

Right, of course. The performance testing. I guess I meant removing the accessibility portion of the Lighthouse tests, but there's not really a lot to do there, in terms of code. But I do think the docs are quite confusing in their current state, since they reference both Lighthouse and Playwright accessibility tests without explaining the redundancy.

Besides that, then, the only downside I see is that it will make CI slower than it could be otherwise, since we'll be running accessibility tests in both Lighthouse and Playwright.

@drammock
Copy link
Collaborator

I do think the docs are quite confusing in their current state, since they reference both Lighthouse and Playwright accessibility tests without explaining the redundancy.

That is something we could improve, for sure

Besides that, then, the only downside I see is that it will make CI slower than it could be otherwise, since we'll be running accessibility tests in both Lighthouse and Playwright.

I think they're concurrent, not serial? And if not, we could see about making them concurrent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: maintenance Improving maintainability and reducing technical debt tag: accessibility Issues related to accessibility issues or efforts
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants