Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with empty content redirection from BNF errored crawls #426

Closed
boogheta opened this issue Oct 28, 2021 · 2 comments
Closed

Problem with empty content redirection from BNF errored crawls #426

boogheta opened this issue Oct 28, 2021 · 2 comments
Assignees
Labels

Comments

@boogheta
Copy link
Member

2021-10-28 11:29:12 [pages] DEBUG: Crawling on Web Archive using for prefix http://archivesinternet.bnf.fr/20210101120000/between 20201218000000 and 20210115235959
2021-10-28 11:29:12 [pages] INFO: Using proxy archivesinternet.bnf.fr:8090
2021-10-28 11:29:12 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/20210101120000/http://www.elysee.fr> (referer: None)
2021-10-28 11:29:15 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.elysee.fr&time=20210101120000> (referer: http://archivesinternet.bnf.fr/20210101120000/http://www.elysee.fr)
2021-10-28 11:29:16 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://www.elysee.fr> (referer: http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.elysee.fr&time=20210101120000)
2021-10-28 11:29:16 [pages] DEBUG: Filtered duplicate request: <GET http://www.elysee.fr/> - no more duplicates will be shown (see DUPEFILTER_CLASS)

archive requested: 01/01/21 12:00
archive contains:

  • 11h07 : empty so redirection to same page and wayback shifts to the next available
  • 15h36 : empty so redirection to same page and wayback shifts to the next available
  • 19h07 : ok

The manual solution relies in just changing the requested date by 1 day, hence collecting a functioning archive. But that's not user-guessable at all!

Since the redirection to the following is unannounced and only handled by the wayback through a query, we cannot guess the good one to test, and since we force our desired date to always remain close to it, we end up requesting duplicates on the empty page

A possible solution might be to:
catch line 234 more cases of redirection when using BNF archives by checking whether http code is between 300 & 400 and the response is empty; in those case run a new request with an extra argument that would tell to the _request function to remove the bnf prefix for this url in the next request, benefiting then from the wayback's intelligence

Other potential ideas:

  • enable following automatically redirections only for BNF archives? (sounds like a bad idea...)
  • use actual datetimes instead of only dates when crawling, allowing at least to use manually proper permalink (such as) from the wayback (sounds nice but probably a mess)
  • just identify this kind of error and make the info available in the frontend's report so that the user knows to retry with a different date
@boogheta boogheta added the bug label Oct 28, 2021
@boogheta boogheta self-assigned this Oct 28, 2021
@boogheta
Copy link
Member Author

boogheta commented Nov 3, 2021

Another example with http/https redirections:

bnf

bnf2

2021-11-02 14:46:32 [pages] DEBUG: Crawling on Web Archive using for prefix http://archivesinternet.bnf.fr/20211024120000/between 20090427120000 and 20340422115959
2021-11-02 14:46:32 [pages] INFO: Using proxy archivesinternet.bnf.fr:8090
2021-11-02 14:46:32 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/20211024120000/http://www.faire-du-theatre.fr> (referer: None)
2021-11-02 14:46:35 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.faire-du-theatre.fr&time=20211024120000> (referer: http://archivesinternet.bnf.fr/20211024120000/http://www.faire-du-theatre.fr)
2021-11-02 14:46:35 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://www.faire-du-theatre.fr> (referer: http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.faire-du-theatre.fr&time=20211024120000)
2021-11-02 14:46:35 [pages] ERROR: Skipping archive page (http://www.faire-du-theatre.fr) within which BNF banner could not be found.
2021-11-02 14:46:35 [scrapy.core.engine] INFO: Closing spider (finished)

@boogheta
Copy link
Member Author

boogheta commented Nov 10, 2021

New log with latests commits:

2021-11-04 10:17:42 [pages] DEBUG: Crawling on Web Archive using for prefix http://archivesinternet.bnf.fr/20211024120000/between 20090427120000 and 20340422115959
2021-11-04 10:17:42 [pages] INFO: Using proxy archivesinternet.bnf.fr:8090
2021-11-04 10:17:42 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/20211024120000/http://www.faire-du-theatre.fr> (referer: None)
2021-11-04 10:17:44 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.faire-du-theatre.fr&time=20211024120000> (referer: http://archivesinternet.bnf.fr/20211024120000/http://www.faire-du-theatre.fr)
2021-11-04 10:17:47 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://www.faire-du-theatre.fr> (referer: http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.faire-du-theatre.fr&time=20211024120000)
2021-11-04 10:17:47 [pages] DEBUG: Filtered duplicate request: <GET http://www.faire-du-theatre.fr/> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2021-11-04 10:17:47 [scrapy.core.engine] INFO: Closing spider (finished)

boogheta added a commit that referenced this issue Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant