Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anchors are being stripped out (using sitemaps, linkExtractor and externalData) #1831

Open
bojanrajh opened this issue Mar 21, 2023 · 4 comments
Labels
crawler issue related to the indexing

Comments

@bojanrajh
Copy link
Contributor

bojanrajh commented Mar 21, 2023

Description

We are using Algolia Crawler UI for parsing our mixed static HTML & SPA website (using hash router). All URLs are provided in sitemaps Crawler config.

new Crawler({
  startUrls: [],
  sitemaps: ["https://example.com/sitemap.xml"],
  // ...
})

Steps to reproduce

Use a sitemap with the following content:

<!-- ... -->
<url>
  <loc>https://example.com/page.html</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<url>
  <loc>https://example.com/subpage.html#/foo</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<url>
  <loc>https://example.com/subpage.html#/bar</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<!-- ... -->

... or using the static linkExtractor:

new Crawler({
  // ...
  linkExtractor: () => {
    return [
      "https://example.com/page.html",
      "https://example.com/subpage.html#/foo",
      "https://example.com/subpage.html#/bar",
    ];
  },
  // ...
})

Then run the URL Tester.

Result:

LINKS
Found 2 links matching your configuration 
 - https://example.com/page.html
 - https://example.com/subpage.html

Expected behavior

Expected result:

LINKS
Found 3 links matching your configuration 
 - https://example.com/page.html
 - https://example.com/subpage.html#/foo
 - https://example.com/subpage.html#/bar

Note those are not section anchors. Those are actual pages, correctly parsed in URL Tester with the renderJavaScript: true option when passing the full URL with the anchor.

Environment

  • Algolia Crawler UI

Similar issues:

@shortcuts
Copy link
Member

Hey, thanks for opening the issue. #1823 seems related.

I'll investigate if there's a way for us to differentiate hash routed pages from anchored sections

@bojanrajh
Copy link
Contributor Author

bojanrajh commented Mar 21, 2023

Thank you for a quick response!
Just for more clarity: we don't mind adding or implementing a custom linkExtractor or recordExtractor with custom set objectID. We just need those URLs to be accepted (crawling works as intended when manually running the crawl from the UI).

@bojanrajh
Copy link
Contributor Author

Hey @shortcuts, any news on this one?

Somehow related, I tried to provide anchored URLs to the Crawler with externalData: ['myCSV], as described in your docs, and those URLs were again stripped down to one.

Example CSV:

url;title;content
"https://example.com/subpage.html#/foo";"Foo";"Foo content"
"https://example.com/subpage.html#/bar";"Bar";"Bar content"

Single URL under Crawler admin > External Data: https://example.com/subpage.html

I would expect the same issue would appear with your API client (JS), but I've just successfully created 2 objects containing URLs with anchors in our demo app (free plan, app ID BZSKX72NEG). However, I was not able to create admin API key for our app (DOCSEARCH plan, app ID J1Y01X9HGM) because the "All API Keys" section/tab is missing. By using the Admin API key I received error 400 - Not enough rights to update an object near line:1.

So, technically, my wild guess would be your system supports anchored URLs, they are just not supported by the crawler?

@bojanrajh bojanrajh changed the title Anchors are being stripped out (using sitemaps and/or linkExtractor) Anchors are being stripped out (using sitemaps, linkExtractor and externalData) Mar 30, 2023
@bojanrajh
Copy link
Contributor Author

Hey @shortcuts, and news about this one?

@randombeeper randombeeper added the crawler issue related to the indexing label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crawler issue related to the indexing
Projects
None yet
Development

No branches or pull requests

3 participants