Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawling Facebook pages creates multiple occurrences of "/unsupportedbrowser" link where there is only one in the page #284

Closed
farjasju opened this issue Sep 10, 2018 · 1 comment
Assignees

Comments

@farjasju
Copy link
Collaborator

The facebook.com/unsupportedbrowser appears up to 50 times in one crawled page where there is only one such link in the original page (in fact, almost every link with an occurrence higher than 2 is facebook.com/unsupportedbrowser). Suspected issue in the redirection process.
nodeweights_hist

@boogheta
Copy link
Member

I think the issue comes from the redirection resolver pipeline here: https://github.com/medialab/hyphe/blob/master/hyphe_backend/crawler/hcicrawler/pipelines.py#L87
It seems like the ResolverAgent uses no UserAgent whatsoever neither any kind of header, worth trying to add it

@boogheta boogheta self-assigned this Jun 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants