Fix sitemap indexing #3708

astrojuanlu · 2024-03-13T15:43:32Z

Descriptions

Even with robots.txt search engines still index pages that are listed as disallowed.

Task

"We need to upskill ourselves on how Google indexes the pages, RTD staff suggested we add a conditional <meta> tag for older versions but there's a chance this requires rebuilding versions that are really old, which might be completely impossible. At least I'd like engineering to get familiar with the docs building process, formulate what can reasonably be done, and state whether we need to make any changes going forward." @astrojuanlu

Context and example

https://www.google.com/search?q=kedro+parquet+dataset&sca_esv=febbb2d9e55257df&sxsrf=ACQVn0-RnsYyvwV7QoZA7qtz0NLUXLTsjw%3A1710343831093&ei=l8bxZfueBdSU2roPgdabgAk&ved=0ahUKEwi7xvujx_GEAxVUilYBHQHrBpAQ4dUDCBA&uact=5&oq=kedro+parquet+dataset&gs_lp=Egxnd3Mtd2l6LXNlcnAiFWtlZHJvIHBhcnF1ZXQgZGF0YXNldDILEAAYgAQYywEYsAMyCRAAGAgYHhiwAzIJEAAYCBgeGLADMgkQABgIGB4YsANI-BBQ6A9Y6A9wA3gAkAEAmAEAoAEAqgEAuAEDyAEA-AEBmAIDoAIDmAMAiAYBkAYEkgcBM6AHAA&sclient=gws-wiz-serp (thanks @noklam)

Result: https://docs.kedro.org/en/0.18.5/kedro.datasets.pandas.ParquetDataSet.html

However, that version is no longer allowed in our robots.txt:

kedro/docs/source/robots.txt

Lines 1 to 9 in 1f2adf1

    
           User-agent: * 
        
           Disallow: / 
        
           Allow: /en/stable/ 
        
           Allow: /en/0.19.3/ 
        
           Allow: /en/0.19.2/ 
        
           Allow: /en/0.19.1/ 
        
           Allow: /en/0.19.0/ 
        
           Allow: /en/0.18.14/ 
        
           Allow: /en/0.17.7/

And in fact, according to https://technicalseo.com/tools/robots-txt/,

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-03-13T15:44:16Z

Taking the liberty here of prioritizing this as High.

tynandebold · 2024-03-26T14:35:14Z

Maybe a manual reindex is what's required here? Or a submission of a sitemap?

https://developers.google.com/search/docs/crawling-indexing/ask-google-to-recrawl

noklam · 2024-03-26T14:45:02Z

I do the site map reindex via Google Search Console all the time.

astrojuanlu · 2024-03-26T15:51:06Z

We had a URL prefix property, so only https://kedro.org and not everything under the kedro.org domain.

Requested a DNS change to LF AI & Data https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-26615

astrojuanlu · 2024-03-26T17:26:47Z

"Indexed, though blocked by robots.txt"

(┛ಠ_ಠ)┛彡┻━┻

astrojuanlu · 2024-03-26T17:28:03Z

https://support.google.com/webmasters/answer/7440203#indexed_though_blocked_by_robots_txt

Indexed, though blocked by robots.txt

The page was indexed despite being blocked by your website's robots.txt file. Google always respects robots.txt, but this doesn't necessarily prevent indexing if someone else links to your page. Google won't request and crawl the page, but we can still index it, using the information from the page that links to your blocked page. Because of the robots.txt rule, any snippet shown in Google Search results for the page will probably be very limited.

Next steps:

If you do want to block this page from Google Search, robots.txt is not the correct mechanism to avoid being indexed. To avoid being indexed, remove the robots.txt block and use 'noindex'.

astrojuanlu · 2024-03-26T17:28:57Z

Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it.

https://developers.google.com/search/docs/crawling-indexing/block-indexing

astrojuanlu · 2024-03-26T17:41:50Z

Previous discussion about this on RTD readthedocs/readthedocs.org#10648

astrojuanlu · 2024-03-26T18:03:13Z

We got some good advice readthedocs/readthedocs.org#10648 (comment)

But blocking this on #3586

astrojuanlu · 2024-04-02T06:33:24Z

Potentially related:

Versions as branches New blog post: Documentation Versioning readthedocs/blog#74 (comment) (might be difficult or impossible to do for older versions, but we might want to do it going forward)
Discussion on making RTD implicit versioning rules more flexible Versions: let users define algorithm to define stable readthedocs/readthedocs.org#11183

noklam · 2024-04-02T14:37:32Z

https://www.stevenhicks.me/blog/2023/11/how-to-deindex-your-docs-from-google/

noklam · 2024-04-02T17:28:23Z

@astrojuanlu It will be very helpful to have access of the Google search console, can we catch up sometime this week? In addition, despite #3729, it appears the robots.txt isn't updated.

I am not super clear about RTD build, do we need to manually refresh the robots.txt somewhere or it only get updated for release?
See: https://docs.kedro.org/robots.txt

astrojuanlu · 2024-04-03T09:39:46Z

To customize this file, you can create a robots.txt file that is written to your documentation root on your default branch/version.

https://docs.readthedocs.io/en/stable/guides/technical-docs-seo-guide.html#use-a-robots-txt-file

The default version (currently stable) has to see a new release for this to happen.

noklam · 2024-04-10T11:54:35Z

We need to make sure sitemap is crawled. See example of vizro

User-agent: *

Disallow: /en/0.1.9/ # Hidden version
Disallow: /en/0.1.8/ # Hidden version
Disallow: /en/0.1.7/ # Hidden version
Disallow: /en/0.1.6/ # Hidden version
Disallow: /en/0.1.5/ # Hidden version
Disallow: /en/0.1.4/ # Hidden version
Disallow: /en/0.1.3/ # Hidden version
Disallow: /en/0.1.2/ # Hidden version
Disallow: /en/0.1.11/ # Hidden version
Disallow: /en/0.1.10/ # Hidden version
Disallow: /en/0.1.1/ # Hidden version

Sitemap: https://vizro.readthedocs.io/sitemap.xml

Ours is blocked currently.

This isn't the primary goal of this ticket but we can also look into it. The main goal of the ticket is "Why URLs that we don't want to be indexed get index", though we would definitely love to improve the opposite "Why URLS that we want to be indexed isn't".

astrojuanlu · 2024-04-10T16:29:05Z

Mind you, we don't want to index /en/latest/. The rationale is that we don't want users to land on docs that correspond to an unreleased version of the code.

ankatiyar · 2024-04-17T09:56:33Z

Updated robots.txt in #3803
Will continue on this after the release - next sprint

astrojuanlu · 2024-04-23T12:22:52Z

Our sitemap still cannot be indexed

astrojuanlu · 2024-05-21T15:30:19Z

Renaming this issue, because there's nothing else to investigate - search engines (well, Google) will index pages blocked by robots.txt because robots.txt is not the right mechanism to deindex pages.

astrojuanlu · 2024-05-23T08:00:52Z

Addressed in #3885, keeping this open until we're certain the sitemap has been indexed.

astrojuanlu · 2024-05-23T08:09:42Z

(robots.txt won't update until a new stable version is out)

astrojuanlu · 2024-05-27T18:57:08Z

robots.txt got updated 👍

astrojuanlu · 2024-05-27T18:58:44Z

astrojuanlu added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Mar 13, 2024

astrojuanlu added the Issue: Bug Report 🐞 Bug that needs to be fixed label Mar 13, 2024

astrojuanlu mentioned this issue Mar 13, 2024

Subprojects accidentally excluded from robots.txt #3710

Closed

astrojuanlu mentioned this issue Mar 25, 2024

Expand robots.txt for Kedro-Viz and Kedro-Datasets docs #3729

Merged

7 tasks

astrojuanlu mentioned this issue Mar 26, 2024

Improve SEO and maintenance of documentation versions #3741

Open

github-actions bot mentioned this issue Apr 1, 2024

Monthly issue metrics report #3764

Open

merelcht changed the title ~~Search engines still index pages disallowed in robots.txt~~ Investigate why search engines still index pages disallowed in robots.txt Apr 2, 2024

merelcht assigned noklam and ankatiyar Apr 2, 2024

This comment was marked as outdated.

Sign in to view

noklam mentioned this issue Apr 11, 2024

Release kedro 0.19.4 #3768

Closed

10 tasks

astrojuanlu changed the title ~~Investigate why search engines still index pages disallowed in robots.txt~~ Fix sitemap indexing May 21, 2024

noklam linked a pull request May 23, 2024 that will close this issue

Update robots.txt for 0.19.6 #3885

Merged

7 tasks

astrojuanlu closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sitemap indexing #3708

Fix sitemap indexing #3708

astrojuanlu commented Mar 13, 2024 •

edited by merelcht

Loading

astrojuanlu commented Mar 13, 2024

tynandebold commented Mar 26, 2024

noklam commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Apr 2, 2024

noklam commented Apr 2, 2024

noklam commented Apr 2, 2024

astrojuanlu commented Apr 3, 2024

noklam commented Apr 10, 2024 •

edited

Loading

This comment was marked as outdated.

astrojuanlu commented Apr 10, 2024

ankatiyar commented Apr 17, 2024

astrojuanlu commented Apr 23, 2024

astrojuanlu commented May 21, 2024

astrojuanlu commented May 23, 2024

astrojuanlu commented May 23, 2024

astrojuanlu commented May 27, 2024

astrojuanlu commented May 27, 2024

Fix sitemap indexing #3708

Fix sitemap indexing #3708

Comments

astrojuanlu commented Mar 13, 2024 • edited by merelcht Loading

Descriptions

Task

Context and example

astrojuanlu commented Mar 13, 2024

tynandebold commented Mar 26, 2024

noklam commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Apr 2, 2024

noklam commented Apr 2, 2024

noklam commented Apr 2, 2024

astrojuanlu commented Apr 3, 2024

noklam commented Apr 10, 2024 • edited Loading

This comment was marked as outdated.

astrojuanlu commented Apr 10, 2024

ankatiyar commented Apr 17, 2024

astrojuanlu commented Apr 23, 2024

astrojuanlu commented May 21, 2024

astrojuanlu commented May 23, 2024

astrojuanlu commented May 23, 2024

astrojuanlu commented May 27, 2024

astrojuanlu commented May 27, 2024

astrojuanlu commented Mar 13, 2024 •

edited by merelcht

Loading

noklam commented Apr 10, 2024 •

edited

Loading