Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paginated pages should be absent from sitemap.xml #2527

Open
pranitbauva1997 opened this issue Jun 15, 2024 · 7 comments
Open

Paginated pages should be absent from sitemap.xml #2527

pranitbauva1997 opened this issue Jun 15, 2024 · 7 comments

Comments

@pranitbauva1997
Copy link

Bug Report

I am using pagination to show blog posts. I have enabled sitemap.xml. I use ahrefs for SEO site audits. It shows me this issue which arises because my sitemap has links for paginated pages which it shouldn't since the links for all blog posts and the link for the list of blog post is there.

Environment

Zola version: 0.18.0

Expected Behavior

The /blog/ URL should be a part of sitemap.xml along with all the blog articles but not /blog/page/1/.

Current Behavior

The paginated pages, i.e. /blog/page/1/, are a part of sitemap.xml, which makes search engines' lives difficult, and it could possibly lead to issues in indexing.

Step to reproduce

Switch on pagination and sitemap generation.

@Keats
Copy link
Collaborator

Keats commented Jun 15, 2024

Isn't the sitemap meant to have links to all pages of a site, including the paginated pages?

@pranitbauva1997
Copy link
Author

@Keats I use ahrefs for understanding how well my website is adapted for SEO and after I implemented pagination it shows me this error and my SEO score dropped from 100 to 97: https://help.ahrefs.com/en/articles/2652498-non-canonical-page-in-sitemap-error-in-site-audit

The paginated pages are non-canonical and the sitemap has the "blog list" page as well as the paginated pages. Currently I see that google has already indexed the paginated page.

I feel pagination is only for viewability for end-users (not overwhelm them with too many blog posts) and not for search engines. I am of the opinion that the search engine should index the "blog list" page (/blog/ for me) and all individual blog posts while leaving out the paginated pages since it's not useful as a search result.

I use site:bauva.com to see all the pages that are indexed by the search engine.

I am happy to contribute a PR once the community decides whether to go forward with this and the approach. I am leaning towards having a configuration variable in config.toml which specifies whether paginated pages should be a part of sitemap.xml with default as it should be present so that it doesn't change the current behaviour. Interested in knowing what are the other approaches for introducing this feature.

Apologies for the late reply. Would be more prompt in my replies going forward.

@Keats
Copy link
Collaborator

Keats commented Jun 23, 2024

You can already use a different template for sitemap.xml but in that case isn't the issue that the template should declare itself as self-canonical? Looking at https://seranking.com/blog/pagination/ we could also set some HTML headers for the paginated pages to ignore them.

@pranitbauva1997
Copy link
Author

pranitbauva1997 commented Jun 24, 2024

@Keats The article you shared has detailed information regarding this. Learnt a few new things as well. Thanks.

Regarding using custom sitemap.xml, I think this is something most of the people should implement because most of them would care about SEO but they wouldn't want to go customise sitemap.xml . Also, I think we still have to introduce some variables for the custom sitemap to know whether a particular page should be included or not. One website can have many categories like "blogs" or "annual reports" (in my case) for pagination and I would have to go make the changes everywhere. I also can't think of a case where I will want it for "blogs" but not for "annual reports".

I think having HTML tags and robots.txt are also crucial since all three (inc. sitemap.xml) affect search engine indexing. I have a _base.html template where I have currently marked every page as to be indexed using the meta tag, I am not sure how to single out paginated page which has to be marked as noindex, though I am sure that this also needs to be a part of the PR.

The robots.txt change is trivial and there is no source code change required in this repo. Each user has to make changes at their end.

For documenting this feature, I am thinking of introducing a "SEO" section in the getzola.org while also including that variable in configurations page. In the SEO section, we can explain the users why we have this feature and how to utilise this along with HTML and robots.txt changes.

What are your thoughts?

@Keats
Copy link
Collaborator

Keats commented Jun 24, 2024

That could be added to the config, but we would need to find a good name for it, or maybe make a new section in it.

@pranitbauva1997
Copy link
Author

@Keats Let me start work on this and come up with a draft PR soon.

@JVimes
Copy link

JVimes commented Oct 2, 2024

It might also be good to include a "noindex" meta tag on non-canonical pages:
https://developers.google.com/search/docs/crawling-indexing/block-indexing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants