Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different results for same search, two years later #496

Open
sofiatipa opened this issue Dec 19, 2023 · 3 comments
Open

different results for same search, two years later #496

sofiatipa opened this issue Dec 19, 2023 · 3 comments

Comments

@sofiatipa
Copy link

Hi,

I repeated a search I did nearly 2 years ago through Hyphe, I am trying to find the co-linkages between two webentities, but the results are quite different. The original search came up with 6 pages that were used by both sites, while the new search shows 3 different pages. Why is that happening? And, is there any way to retrieve the original search from your online version?

@boogheta
Copy link
Member

Hello @sofiatipa, I can only guess, but over two years it would sound reasonable that the websites you crawled did change quite a bit since, hence returning logically different results as of today.
You can try and use the webarchives to retrieve the same corpus as it was back then (activating it from an empty corpus in the Settings tab), but archives are not always complete so there's no warranty.

@sofiatipa
Copy link
Author

sofiatipa commented Dec 19, 2023 via email

@boogheta
Copy link
Member

boogheta commented Dec 19, 2023

Hello again,

It looks like the Geopolitika.ru website has quite an aggressive approach towards web crawler and it basically refuses most robots through some (quite smart) methods, which apparently also block Web.Archive.org from archiving it (see for instance here https://web.archive.org/web/20200417113623/https://www.geopolitika.ru/).

There is no way to make Hyphe work with this website as of today unfortunately.

You can although go back far enough in time before they put those measures in place: just explore the web archives until you find a functional version and ask Hyphe to crawl at that date.
You can do so by inputting the url of the web archive directly into the IMPORT box of Hyphe.

For instance I got a crawl working with more than 70 pages visited in 2018 by using this url as startpoint: https://web.archive.org/web/20180212120000/https://www.geopolitika.ru

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants