-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initialCanonicalUrl
in scripts confuse search crawlers
#53274
Comments
I also faced with this issue, I was looking for the solution for weeks but found nothing. I Also asked chat GPT to recommend any solution, but none of them solved the problem. It's nice to find this issue, hope I can get the solution for this issue. Thanks. |
Any update on this? Still seems to be an outstanding bug. |
Any updates / advice on workarounds? Thanks! |
This comment has been minimized.
This comment has been minimized.
Did anyone find a workaround for this? |
We have partially mitigated this by additing more patterns on robots.txt but that has nothing to do with addressing the root cause |
@adomaskizogian I see.. thanks, I think I will try to manually replace the wrong texts next generates on our static pages.. hopefully it will work |
I have so many 404 in my google search console.. and I just realized it's because of this issue 😓 |
Google won't pick the
Vercel about page is not affected since google is picking up the canonical url from meta tag, that value is not related to SEO. |
Im telling you for sure that I have a bunch(thousands) of urls google “discovered” due to this variable. I agree that google shouldnt pick it up as a url.. but it does. |
@huozhi I bet you that if you go to google search console and look for the section of pages that causes redirect or 404 you will see the page: |
This comment has been minimized.
This comment has been minimized.
Is not a correct information. Google picks it and wasting the crawling budget. For people who faced with this problem and have CDN - there can be a solution to filter the text content coming from server with the regexp like this if (responseHTML.includes("initialCanonicalUrl")) {
responseHTML = responseHTML.replace(
/("initialCanonicalUrl\\?":\\?"[^?"]+)(\??)([^"\\]+)(\\?")/,
(match, p1, p2, p3, p4) => {
if (p3.includes("_rsc=")) {
return p1 + p4;
}
return p1 + p2 + p3 + p4;
}
);
} |
This should be resolved with this refactor which will no longer serialize a prop called |
@gnoff I looked at the PR description and code, Im not sure it will resolve it.
so my source had Anyways I believe that as long as next will insert this data inline as part of the page, it will still be picked up by google |
Reopening this until we can verify the above mentioned PR fixes it. |
hmm @omerman it seems like this would be a problem then for that string appearing anywhere in the document. Seems quite aggressive for google to make any kind of assumption about url-matching string sequences though I don't doubt you are seeing what you are seeing. Unfortunately the benefits of encoding the initial flight data in the document are significant and so we don't have an easy way to completely omit it. We could do some alternate encoding like an array of path sequences that get reified on the client but that really only solves url-like sequences for this one specific use case and not any url-apparent string. Will think about this some more and see if we can come up with something that makes sense |
Reviewing further it seems like google is crawling anything url-like it finds in the byte sequence of the document (meaning it doesn't have to parse the contents as a string per-se). This is presumably to discover links that might be worth visiting and would certainly find false positives amongst a large variety of data that can be present on any given page. What I'd like to better understand is why this matters. What is the practical downside to google attempting to visit pages that it thinks might exist? While we might have a workaround for this one specific url-patterned string I'm not particularly motivated to special case this in Next.js when arbitrary data can also trigger the same exact behavior from google. But I admit I might be ignorant of some consequence to this that warrants handling this very specific special case |
As @studentIvan mentioned, it's a waste of crawl budget. Maybe base64 encoding the URL could help? |
@gnoff My preferred approach with next is to have it write a js file with the preflight and reference it using a script tag in the main html document(That way its not inline and to my knowledge google doesnt analyze script tags references as well). but Thats something Im saying without knowing the implications it has on the performance. In anycase tbh Im not sure myself if Google will rank me worse because of these resulting 404s… its not like they share it 😂 |
I understand that theoretically it's a wasted crawl. But my read of crawl budget is that 404's effectively neutralize this and that the budget only impacts sites operating at a scale where you are likely to have sitemaps or other alternate crawling mechanisms in place to index the correct pages explicitly. Has anyone on this thread experienced a material adverse outcome because of this string? I'm genuinely asking b/c I want to better understand the concern. In my experience SEO/indexing behavior has a perception of being arcane knowledge that is conveyed in "best practices" because getting clear answers from google can be hard and so there is a demand for certain things to work in accordance with perceived problems even if they are not experienced problems |
@omerman I wrote up a bit about why we don't do this here: #42170 (comment) We don't base64 encode the entire inlined flight stream b/c it increases it's size and then you need to decode it anway which is a bit slower than just parsing. Though when we support binary data we will need to do this for the bits that contain binary data. I think we can look into changing the format of this URL. It's internal only so it's not like it being a string is part of any public API. It will help this specific case |
This comment has been minimized.
This comment has been minimized.
@ztanner , will this be merged back to 14 since 15 is still in RC? One thing we noticed is that we see different behavior when building and running locally compared to running it on vercel which doesn't make sense. |
Yes! |
This closed issue has been automatically locked because it had no new activity for 2 weeks. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you. |
Verify canary release
Provide environment information
Operating System: Platform: darwin Arch: x64 Version: Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103 Binaries: Node: 18.13.0 npm: 9.8.1 Yarn: 1.22.10 pnpm: N/A Relevant packages: next: 13.4.7 eslint-config-next: 13.4.1 react: 18.2.0 react-dom: 18.2.0 typescript: 5.0.4
Which area(s) of Next.js are affected? (leave empty if unsure)
Metadata (metadata, generateMetadata, next/head)
To Reproduce
Open Vercel website on /about page, open devtools 'Elements' tab, open search input with
cmd + F
and typeinitialCanonicalUrl
, you will see this in one of the <script> tags:... \"initialCanonicalUrl\":\"/app-future/en/about\", ...
Open /app-future/en/about and see that you get the same page!
Describe the Bug
Actual page path and
initialCanonicalUrl
are different.Our SEO department reported this problem to us, as they said it confuses the search engine robots (don't ask me why) and reduce SEO optimization points.
For our case it looks like:
/blog/example-article
– actual page path at English locale/en/blog/example-article
– initialCanonicalUrl from <script>Expected Behavior
/blog/example-article
– actual page path at English locale/blog/example-article
– initialCanonicalUrl from <script>Which browser are you using? (if relevant)
Google Chrome
How are you deploying your application? (if relevant)
No response
The text was updated successfully, but these errors were encountered: