Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sitemap indexing #3708

Closed
astrojuanlu opened this issue Mar 13, 2024 · 23 comments · Fixed by #3885
Closed

Fix sitemap indexing #3708

astrojuanlu opened this issue Mar 13, 2024 · 23 comments · Fixed by #3885
Assignees
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@astrojuanlu
Copy link
Member

astrojuanlu commented Mar 13, 2024

Descriptions

Even with robots.txt search engines still index pages that are listed as disallowed.

Task

"We need to upskill ourselves on how Google indexes the pages, RTD staff suggested we add a conditional <meta> tag for older versions but there's a chance this requires rebuilding versions that are really old, which might be completely impossible. At least I'd like engineering to get familiar with the docs building process, formulate what can reasonably be done, and state whether we need to make any changes going forward." @astrojuanlu

Context and example

https://www.google.com/search?q=kedro+parquet+dataset&sca_esv=febbb2d9e55257df&sxsrf=ACQVn0-RnsYyvwV7QoZA7qtz0NLUXLTsjw%3A1710343831093&ei=l8bxZfueBdSU2roPgdabgAk&ved=0ahUKEwi7xvujx_GEAxVUilYBHQHrBpAQ4dUDCBA&uact=5&oq=kedro+parquet+dataset&gs_lp=Egxnd3Mtd2l6LXNlcnAiFWtlZHJvIHBhcnF1ZXQgZGF0YXNldDILEAAYgAQYywEYsAMyCRAAGAgYHhiwAzIJEAAYCBgeGLADMgkQABgIGB4YsANI-BBQ6A9Y6A9wA3gAkAEAmAEAoAEAqgEAuAEDyAEA-AEBmAIDoAIDmAMAiAYBkAYEkgcBM6AHAA&sclient=gws-wiz-serp (thanks @noklam)

Result: https://docs.kedro.org/en/0.18.5/kedro.datasets.pandas.ParquetDataSet.html

image

However, that version is no longer allowed in our robots.txt:

User-agent: *
Disallow: /
Allow: /en/stable/
Allow: /en/0.19.3/
Allow: /en/0.19.2/
Allow: /en/0.19.1/
Allow: /en/0.19.0/
Allow: /en/0.18.14/
Allow: /en/0.17.7/

And in fact, according to https://technicalseo.com/tools/robots-txt/,

image

@astrojuanlu astrojuanlu added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Mar 13, 2024
@astrojuanlu
Copy link
Member Author

Taking the liberty here of prioritizing this as High.

@tynandebold
Copy link
Member

Maybe a manual reindex is what's required here? Or a submission of a sitemap?

https://developers.google.com/search/docs/crawling-indexing/ask-google-to-recrawl

@noklam
Copy link
Contributor

noklam commented Mar 26, 2024

I do the site map reindex via Google Search Console all the time.

@astrojuanlu
Copy link
Member Author

We had a URL prefix property, so only https://kedro.org and not everything under the kedro.org domain.

Requested a DNS change to LF AI & Data https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-26615

@astrojuanlu
Copy link
Member Author

"Indexed, though blocked by robots.txt"

Screenshot 2024-03-26 at 17-25-30 URL Inspection

(┛ಠ_ಠ)┛彡┻━┻

@astrojuanlu
Copy link
Member Author

https://support.google.com/webmasters/answer/7440203#indexed_though_blocked_by_robots_txt

Indexed, though blocked by robots.txt

The page was indexed despite being blocked by your website's robots.txt file. Google always respects robots.txt, but this doesn't necessarily prevent indexing if someone else links to your page. Google won't request and crawl the page, but we can still index it, using the information from the page that links to your blocked page. Because of the robots.txt rule, any snippet shown in Google Search results for the page will probably be very limited.

Next steps:

@astrojuanlu
Copy link
Member Author

Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it.

https://developers.google.com/search/docs/crawling-indexing/block-indexing

@astrojuanlu
Copy link
Member Author

Previous discussion about this on RTD readthedocs/readthedocs.org#10648

@astrojuanlu
Copy link
Member Author

We got some good advice readthedocs/readthedocs.org#10648 (comment)

But blocking this on #3586

@astrojuanlu
Copy link
Member Author

Potentially related:

@merelcht merelcht changed the title Search engines still index pages disallowed in robots.txt Investigate why search engines still index pages disallowed in robots.txt Apr 2, 2024
@noklam
Copy link
Contributor

noklam commented Apr 2, 2024

https://www.stevenhicks.me/blog/2023/11/how-to-deindex-your-docs-from-google/

@noklam
Copy link
Contributor

noklam commented Apr 2, 2024

@astrojuanlu It will be very helpful to have access of the Google search console, can we catch up sometime this week? In addition, despite #3729, it appears the robots.txt isn't updated.

I am not super clear about RTD build, do we need to manually refresh the robots.txt somewhere or it only get updated for release?
See: https://docs.kedro.org/robots.txt

@astrojuanlu
Copy link
Member Author

To customize this file, you can create a robots.txt file that is written to your documentation root on your default branch/version.

https://docs.readthedocs.io/en/stable/guides/technical-docs-seo-guide.html#use-a-robots-txt-file

The default version (currently stable) has to see a new release for this to happen.

@noklam
Copy link
Contributor

noklam commented Apr 10, 2024

We need to make sure sitemap is crawled. See example of vizro

User-agent: *

Disallow: /en/0.1.9/ # Hidden version
Disallow: /en/0.1.8/ # Hidden version
Disallow: /en/0.1.7/ # Hidden version
Disallow: /en/0.1.6/ # Hidden version
Disallow: /en/0.1.5/ # Hidden version
Disallow: /en/0.1.4/ # Hidden version
Disallow: /en/0.1.3/ # Hidden version
Disallow: /en/0.1.2/ # Hidden version
Disallow: /en/0.1.11/ # Hidden version
Disallow: /en/0.1.10/ # Hidden version
Disallow: /en/0.1.1/ # Hidden version

Sitemap: https://vizro.readthedocs.io/sitemap.xml

Ours is blocked currently.

Image

This isn't the primary goal of this ticket but we can also look into it. The main goal of the ticket is "Why URLs that we don't want to be indexed get index", though we would definitely love to improve the opposite "Why URLS that we want to be indexed isn't".

Image

@noklam

This comment was marked as outdated.

@astrojuanlu
Copy link
Member Author

Mind you, we don't want to index /en/latest/. The rationale is that we don't want users to land on docs that correspond to an unreleased version of the code.

@noklam noklam mentioned this issue Apr 11, 2024
10 tasks
@ankatiyar
Copy link
Contributor

Updated robots.txt in #3803
Will continue on this after the release - next sprint

@astrojuanlu
Copy link
Member Author

Our sitemap still cannot be indexed

@astrojuanlu
Copy link
Member Author

Renaming this issue, because there's nothing else to investigate - search engines (well, Google) will index pages blocked by robots.txt because robots.txt is not the right mechanism to deindex pages.

@astrojuanlu astrojuanlu changed the title Investigate why search engines still index pages disallowed in robots.txt Fix sitemap indexing May 21, 2024
@astrojuanlu
Copy link
Member Author

Addressed in #3885, keeping this open until we're certain the sitemap has been indexed.

@astrojuanlu
Copy link
Member Author

(robots.txt won't update until a new stable version is out)

@noklam noklam linked a pull request May 23, 2024 that will close this issue
7 tasks
@astrojuanlu
Copy link
Member Author

robots.txt got updated 👍

@astrojuanlu
Copy link
Member Author

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants