RFC: server-side routing, and SEO, for new classifier URLs #2446
Replies: 44 comments
-
Ok, these are all useful things to consider, but let's step back and document why we're making some of these changes to begin with. We've already proposed the new routes in ADR 18: https://github.com/zooniverse/front-end-monorepo/blob/master/docs/arch/adr-18.md The primary reason is that the current PFE behavior has been a constant source of confusion volunteers, project owners, and internally in our team (I've had to explain the priority order of selection many, many times). Not only that, but the underlying code is difficult to maintain. What ADR 18 captures is that we want to use these routes for workflows, subject sets, and subjects, but what it doesn't capture and assumes is that the routes would still be 'automatically' selected resolved based on the same selection strategies. However, in our discussion in #1132, I think we're generally moving toward not automatically selecting any workflow for a volunteer for most cases because that is still the source of the confusion. This means no more random selection, no auto redirects from one workflow to another. Combined with the project goals of Engaging Crowds, we've built a UI to present the volunteer with a choice. Considering these two goals, to minimize confusion as well as to have better maintainable code, I think we should use a 404 for a resource that is not available and present the volunteer with the option to select a different workflow (or later on another project if no other workflows are available). For the other HTTP response options, the 301 or 302 don't make sense to me because the page hasn't moved for that workflow resource when it becomes inactive and unavailable.
I agree, though, @beckyrother may have some other thoughts about this having to do with consistency of UX when you arrive to the For security, I'd recommend that all ids be validated against the serialized links array of active workflow ids on the project. I made a comment about this previously with regard to the default workflow: #1961 (comment). Second, we have to consider that certain authenticated user roles have permission to load workflows regardless of the workflow's active state: admins, project owners, project collaborators, testers, and maybe also experts(?). For these users, the route to a workflow regardless of its active state should load and not 404 as long as it actually exists. I think what this means is that we have two stages of validation:
|
Beta Was this translation helpful? Give feedback.
-
That's a great point. The PR merges this week removed default workflows from the project models in the classifier and the NextJS app. For projects with multiple workflows, if no workflow is specified, then the classifier will not select one for you. I should add that to the description. It's a big change, but not one we'd notice on staging because it doesn't affect PH-TESS. Validating workflow IDs against a project's workflow links seems sensible. Once #1964 is merged, I can add 404 rules for workflows too. If I add them at the moment, I think the changes to the Classify page will clash with that PR. Or I could expand that PR to include workflows. I'd be interested in hearing from project owners as to whether we should build Classify pages for inactive workflows. I can see big positives and big negatives to doing that. I think I'm kind of leaning towards positives, but mostly on the fence myself. I forgot to say in the original description: the technical problem I've got at the moment is writing routing rules for the NextJS app. I'd like the outcome of this RFC to be a decision (in an ADR) as to what those routing rules should be. NextJS has the concept of preview routes for pages that are in development/not yet public. I've been thinking those might be useful for routes that require authentication and workflows that are in development or testing mode. |
Beta Was this translation helpful? Give feedback.
-
If we validate only against active workflows then we won't be able to publish pages for inactive workflows. Old workflows would vanish from the site, and any existing external links to the Classify page for those projects would break (if they had linked directly to the workflow by ID.) That's the biggest negative consequence of these changes, to my mind. On projects like Galaxy Zoo, where there's a single active workflow that changes over time, we might be able to avoid link rot by making |
Beta Was this translation helpful? Give feedback.
-
@shaunanoordin mentioned in passing that there might be implications for the URLs used by Zooniverse classrooms. I haven't mentioned those here because I'm not 100% sure how classrooms work. |
Beta Was this translation helpful? Give feedback.
-
I think it would be helpful to enumerate what the positives and negatives are. I'm leaning toward not building these pages. I can conceive of several reasons why this would be a negative and seems like a possible security or privacy issue to me. They could be workflows that were from the development phase of the project that really should not ever be activated or workflows that are intended to be private only for project experts.
This looks like a promising feature, we could note this as being a possible solution for these specific scenarios?
The specific functionality depends on whether the educational program is Intro to Astro like or Wildcam like, but the heart of it is that these use cases really needed |
Beta Was this translation helpful? Give feedback.
-
We have an additional consequence of what does the default workflow and UPP stored workflow now mean if we're not going to automatically load these anymore in a preferred order. Perhaps we can indicate which workflow is default and which you last worked on in the selector modal UI? Default could render somehow as "Project suggested" and your UPP stored one as "Last worked on"? |
Beta Was this translation helpful? Give feedback.
-
You're absolutely right. I've extended #1964 to add 404 pages for inactive workflows. I was thinking of the case where a long-running project uses different workflows over the lifetime of a project. Workflow URLs allow us to preserve the workflow history, similar to how Galaxy Zoo 1, Galaxy Zoo 2 etc. are preserved, for research purposes. This would then allow papers to cite the exact workflow that was used to obtain results. When a project like Galaxy Zoo takes down a workflow and replaces it with another, we do want to be careful about harming our search ranking. 404 will break all incoming links to the old workflow, which would lower the ranking of the Classify page. SEO best practice would be to redirect the old workflow ID to the new workflow ID, if possible. We can mitigate this by redirecting |
Beta Was this translation helpful? Give feedback.
-
@chrislintott asked recently about projects that show a random workflow to each volunteer. It's possible to route |
Beta Was this translation helpful? Give feedback.
-
Wouldn't marking a workflow as 'Project suggested' bias classifiers towards choosing that workflow? I'm not sure about the consequences of that. UPP definitely needs to be considered, but that's handled client-side and I'm trying to focus the discussion here towards how I should set up the server-side config for the NextJS app. I think the client-side auth, in the Next app, will have to be updated to load in UPP alongside the user. Then the UPP can be used to update the workflow menu in the browser. At the moment, I have the workflow menu set to wait until the user has loaded, in order to prepare for this. That code could probably be updated to use Suspense in the browser. |
Beta Was this translation helpful? Give feedback.
-
We discussed this in #1132 and hardly any projects use random workflow selection. Just as a reminder: #1132 (comment) We only found three projects and two were likely work arounds for special circumstances. The one possible legitimate use case that exists, we have no idea why it may have been wanted as it was never documented. Because there is a single project out of the 100s, I think we should move forward with deprecation.
Default workflow selection already biases volunteers to work on it, just without their knowledge.
This should be noted in whatever ADR comes out of this as consequences at the very least. I would recommend #1132 and this discussion merge into the same ADR. I don't think these can be separated discussions and decisions because they impact each other. |
Beta Was this translation helpful? Give feedback.
-
I think security and privacy concerns trump SEO concerns. |
Beta Was this translation helpful? Give feedback.
-
I'd be in favour of removing the complexity here and preserving all the existing workflow URLs, so a 200 response vs a redirection response. I was thinking that any workflows that are finished and/or deactivated could have a different UI component that overlays the underlying classifier. Perhaps a dismissible UI component that obfuscates or disables the interface till it's dismissed / interacted with. This finished / deactivated UI could also signal the user back to the workflow selection area to access the workflow that need more contributions. I'll defer on the UX to those better able than myself. The above would preserve the ability to still use old workflows for posterity but ensure we signal that this workflow is finished and doesn't need any more contributions in a clear way. This is something the PFE system has always had trouble with and confuses volunteers.
Agree that automatic selection was often confusing and led to classifications being submitted where they weren't useful. I agree that we should not do any automatic workflow selection. In my opinion we should be allow the user to make good judgements via UI signaling and providing the information to the user if a workflow is finished / deactivated.
What about when a user wants to access a historical workflow for a paper / demonstration / outreach session? A hard 404 here would break this use case. I think a 404 would be reasonable if the old workflow was actually deleted from the API. I vote strongly in favour of keeping the old workflows around and accessible on their existing URLs. |
Beta Was this translation helpful? Give feedback.
-
I specifically mean using a 404 for a workflow that no longer exists as a resource on Panoptes or for workflows that the user does not have permission to load. Finished workflows would 200 and would use a UI prompt to encourage them to work on something else. See #642 |
Beta Was this translation helpful? Give feedback.
-
Finished workflows on Galaxy Zoo (off the top of my head) are inactive, though. Right now, projects don't seem to distinguish between stuff that's experimental, in development etc. and workflows that used to be live but have been turned off because they've finished gathering classifications. I do like the idea of workflows having permanent URLs for posterity, but I'd defer to the wishes of project owners and builders here. |
Beta Was this translation helpful? Give feedback.
-
We don't allow most users to load inactive workflows with the exception of admins, owners, collaborators, testers, and experts I believe, therefore this is functionally a permissions feature and we are currently redirecting users if they do not have permission. In the new classifier, I'm proposing we inform users who do not have permission it's not available, which is what a 404 is and asking them what they want to do rather than redirect. There are many projects that inactivate workflows not because they want to control who is able to load the workflow, but because we don't have a good UX solution for redirecting effort and people consistently do not read or understand the tiny 'finished' banners. The new prompt to ask users what they want to do will cover this case, so I predict we'll see less workflows set to be inactive and more that load and prompt asking what the user wants to do instead. Then inactive workflows will functionally behave like what I think they're really intended to be for, a way to control who has permission to load and view the workflow. |
Beta Was this translation helpful? Give feedback.
-
Bumping this because Davy Notebooks is now at a stage where they are shutting down completed workflows, causing those pages to 404 (or worse, error when GoogleBot fetches them. See #2412.) Here's an example of a broken URL for a completed notebook: https://www.zooniverse.org/projects/humphrydavy/davy-notebooks-project/classify/workflow/18244 A 404 page for workflows would work here, where we explain that the workflow has been finished/turned off for secret reasons/ eaten by a gru and then point the volunteer to active workflows that need work. Pinging @snblickhan because we've been talking about this recently for Engaging Crowds projects. |
Beta Was this translation helpful? Give feedback.
-
There's been a design 404 page for some time. I would specifically recommend that there is a specific 404 type page for workflows where the workflow menu is shown so users can select from what is available (if none, then we would show the similar project recommendations, yet to be implemented). I've recommended a 404 plus prompt asking them what in several past comments and it still continues to be my recommendation. |
Beta Was this translation helpful? Give feedback.
-
Has anyone got thoughts about which project URLs should be indexed by Google and which should be marked as At the moment, we’re getting emails from Google because there are high rates of 500 errors on URLs for workflows which had been indexed but are now complete. |
Beta Was this translation helpful? Give feedback.
-
Forgot to add: I'm also wondering if subjects should be indexed by search engines. GoogleBot will go through every subject link in the subject picker, and index it, unless we tell it not too. That's a lot of URLs for a reasonable sized project. |
Beta Was this translation helpful? Give feedback.
-
@srallen project-level and workflow-level 404 pages make sense to me too. NextJS only supports one, static 404 page per app, at I haven't really thought about how we handle 'not found' errors at the subject set or subject level. I'm open to suggestions or ideas for that. |
Beta Was this translation helpful? Give feedback.
-
@eatyourgreens subject set or subjects not found perhaps could function similar to what I've done in #2418. The classifier is paused from loading and the selector modal is opened to prompt to select from what is available. |
Beta Was this translation helpful? Give feedback.
-
@srallen Thanks! I'll take a look. We now have a finished project, from the Scarlets & Blues alpha, and URLs are kind of broken because the workflow is undefined, so I'm open to ideas as to how those might work. Re-using existing behaviour from Gravity Spy, rather than defining unique behaviour just for Engaging Crowds projects would definitely get my vote too. |
Beta Was this translation helpful? Give feedback.
-
I wonder if we could use |
Beta Was this translation helpful? Give feedback.
-
The only point I really feel qualified to give here is that I think we shouldn't assume that project builders will leave workflows active (DNP is a great example). If this practice causes errors as in #2412 maybe that's a reason not to do it any longer. Otherwise, we need to communicate to project builders that leaving workflows set to 'Active' is best practice. Would the benefit to indexing specific workflow URLs be for web archiving, for example? That's the only reason I could think of (but maybe these two things are unrelated -- I'm not an expert here). As a user I know I find it super annoying when I'm using Google to search for a project and I get a link that isn't just the project homepage. |
Beta Was this translation helpful? Give feedback.
-
Deactivated workflows give you a Page Not Found error. Try this link, which fixes the bug. |
Beta Was this translation helpful? Give feedback.
-
That’s useful to know about Google search. We could block search engines from indexing all the Classify pages for a project. It seems, to me, like the Research pages should be searchable though. |
Beta Was this translation helpful? Give feedback.
-
@snblickhan you’re right about indexing and archiving being related. Internet Archive honours the same protocols as GoogleBot. So it won’t harvest a URL that’s been blocked. |
Beta Was this translation helpful? Give feedback.
-
Becky provided a design for a prompt for when workflows unavailable: #2445 |
Beta Was this translation helpful? Give feedback.
-
I'll be converting this to a discussion so there's one place to look for RFC discussions. Discrete tasks can be made into issues when they're identified and the decisions documented into the ADRs. |
Beta Was this translation helpful? Give feedback.
-
Updated 404 page designs can be found here: Invision. And the resources can be found here: Design-Resources. |
Beta Was this translation helpful? Give feedback.
-
Package
app-project
Description
As part of Engaging Crowds, we're retiring the
/classify
catch-all URL for a project, which loads a classifier page for your currently chosen workflow. Every workflow has the same page URL.We're replacing that with individual URLs for workflows, subject sets and subjects (the latter two only for projects that allow volunteers to select a subject set and subject to classify.) Each of the following will have its own Classify page, built and served by NextJS:
/classify/workflow/:workflowID
./classify/workflow/:workflowID/subject-set/:subjectSetID
.classify/workflow/:workflowID/subject-set/:subjectSetID/subject/:subjectID
At the moment, pages are built at request time, with each page being built afresh on every incoming request for that page.
Consequences
SEO
Each of these URLs can be crawled, and indexed, by search engines. When a project home page is indexed, the Indexing Tool (if present) will be crawled and requests made for every subject set and subject listed. NextJS will then try to build pages for every subject set and subject. Do we want this to happen? What are the consequences of allowing/blocking search-engine indexing of the Classify pages?
Over time, pages will become stale: workflows will be deactivated, subject sets and subjects completed. Do we want page URLs to persist in search engine indexes when workflows, subject sets or subjects have been fully classified? Here, there might be an argument for persisting URLS, for workflows at least, so that they can be cited in papers.
Server responses
Related to the above, I've started setting up responses for workflows (but not for subject sets or subjects yet.) When a workflow's finished, and maybe replaced with a new workflow (new URL), I'm not completely sure what the appropriate response should be:
301 redirects are best practice SEO when you move a page, so I've set up a redirect at the workflow level in #1965 eg. Galaxy Zoo retires an old workflow, and starts a new one. Links to the old workflow Classify page would now automatically point to the new one. I'm not sure if that's the correct approach. Maybe the old page should live on, at the old address? Note that the new page might start off with a lower search ranking, since all the incoming links on the web will point to the old workflow URL.
I think we are all agreed that
/classify
should 301 redirect to/classify/workflow/:workflowID
for projects like PH-TESS and Galaxy Zoo, where there's only ever one active workflow.Canonical URLs and page titles
Since we're minting individual page URLs for workflow, subject sets and subjects, each of those URLs should have a unique page title for SEO and bookmarking. We should probably publish canonical URLs, at the very least for projects with a single workflow, where
/classify
points to/classify/workflow/:activeWorkflowID
.At the moment, the project app is hardcoded to use a single, constant title for all project pages.
Performance
At the moment, we run a page build on every incoming request for a URL. We could improve performance, and lower costs by leveraging static optimisation and serving our HTML via a CDN cache.
Security
We should avoid building a page for any incoming request URL. If a malicious actor writes a script that generates a large number of fictitious workflow IDs, requesting
/classify/workflow/:workflowID
for each in turn, we don't want NextJS to start building those pages until it falls over. Is there a way to quickly validate the incoming request, via the API, and respond with 404 for obviously wrong or mistyped IDs?Beta Was this translation helpful? Give feedback.
All reactions