Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which might require it #845

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sebastian-nagel
Copy link
Contributor

@sebastian-nagel sebastian-nagel commented Dec 4, 2024

  • strip the userinfo from the authority only for HTTP and HTTPS

…ght require it

- strip the userinfo from the authority only for HTTP and HTTPS
@sebastian-nagel sebastian-nagel changed the title NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which mi…ght require it NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which might require it Dec 4, 2024
@HiranChaudhuri
Copy link
Contributor

Does it make sense to decide stripping authority data based on the protocol? I acknowledge most users want to scan the internet anonymously. But intranets or users interested to index 'their' content, be it on local or remote servers will need authority data to be preserved while they have no control over the protocol. Thus I suspect sometimes it may be required even though https is used.

How about making it configurable, maybe via regexp? This would allow Nutch users to define the protocol, or the site or ... where to preserve the authority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants