Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoded special characters should be decoded by the crawler #1723

Open
tillprochaska opened this issue Dec 20, 2022 · 0 comments
Open

Encoded special characters should be decoded by the crawler #1723

tillprochaska opened this issue Dec 20, 2022 · 0 comments
Labels
crawler issue related to the indexing

Comments

@tillprochaska
Copy link

Description

When websites contain encoded special characters (e.g. & instead of &), they aren’t displayed correctly in the DocSearch UI.

Steps to reproduce

  1. Set up a website that uses encoded special characters (e.g. &) in a heading.
  2. Use the docsearch helper in the record extractor to extract contents.
  3. The encoded special characters are indexed.
  4. When rendering the DocSearch UI, the record contents are encoded again which means that & is displayed rather than &.

You can test this on our documentation site: https://docs.aleph.occrp.org. The sidebar has a "Developers & Admins" section.

This is what it is rendered like in the DocSearch UI:
Screen Shot 2022-12-20 at 15 56 13

Expected behavior

Algolia Crawler or the docsearch helper should take care of decoding encoded characters, e.g. by using the respective Cheerio configuration when loading the HTML or by decoding after extracting contents from the HTML.

I’ve adjusted the Crawler configuration and manually decode & for now, but it would be nice if either Aloglia Crawler or the docsearch helper could do this automatically.

recordExtractor: ({ helpers }) => {
  const data = helpers.docsearch({ /* ... */ });

  data.forEach((record) => {
    Object.entries(record.hierarchy)
      .filter(([level, content]) => content)
      .forEach(([level, content]) => {
        record.hierarchy[level] = content.replace("&", "&");
      });
  });

  return data;
}

Environment

  • OS: Any OS
  • Browser: Any browser
  • DocSearch version: 3.3.0
@randombeeper randombeeper added the crawler issue related to the indexing label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crawler issue related to the indexing
Projects
None yet
Development

No branches or pull requests

2 participants