Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

closes #399: on-the-fly calculation of checksum #400

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cnsgithub
Copy link

No description provided.

@Chaiavi
Copy link
Contributor

Chaiavi commented Jan 19, 2020

Why do we need the md5 checksum of a page content ?

What use can be done with it ?

@dgoiko
Copy link

dgoiko commented Jan 26, 2020

Why do we need the md5 checksum of a page content ?

What use can be done with it ?

If a crawler's visit algorithm performs expensive operations on pages and then stores only the extracted information it may be usefull to have a common checksum storage where they can check if an identical page has already been processed and ignore them in the future, for instance. The problem is that every page with a non-js clock, visit counter, a tiny little PHP dinamic banner or additional whitespace printed God knows why would break the equality.

I'm currently using a more html-driven solution by checking a specific tag that contains a field that I know can be taken as a "primary key" for the website, but I've to parse the content into a jsoup Document first, so it is probably more expensive and requires me to exactly know the exact layout of crawled pages (which I know because I'm crawling information, not just html documents)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants